Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test 2024 11 19 #3679

Closed
wants to merge 17 commits into from
9 changes: 7 additions & 2 deletions .wordlist.txt
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@ iGPU
inlined
inplace
interop
Interoperation
interoperation
interoperate
Interprocess
interprocess
Expand Down Expand Up @@ -113,6 +113,8 @@ omnitrace
overindex
overindexing
oversubscription
overutilized
parallelizable
pixelated
pragmas
preallocated
Expand Down Expand Up @@ -150,11 +152,14 @@ texels
tradeoffs
templated
toolkits
transfering
typedefs
unintuitive
UMM
unmap
unmapped
unmapping
upscaled
variadic
WinGDB
zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz
zc
126 changes: 126 additions & 0 deletions docs/how-to/hip_runtime_api/call_stack.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
.. meta::
:description: This page describes call stack concept in HIP
:keywords: AMD, ROCm, HIP, call stack

*******************************************************************************
Call stack
*******************************************************************************

The call stack is a data structure for managing function calls, by saving the
state of the current function. Each time a function is called, a new call frame
is added to the top of the stack, containing information such as local
variables, return addresses and function parameters. When the function
execution completes, the frame is removed from the stack and loaded back into
the corresponding registers. This concept allows the program to return to the
calling function and continue execution from where it left off.

The call stack for each thread must track its function calls, local variables,
and return addresses. However, in GPU programming, the memory required to store
the call stack increases due to the parallelism inherent to the GPUs. NVIDIA
and AMD GPUs use different approaches. NVIDIA GPUs have the independent thread
scheduling feature where each thread has its own call stack and effective
program counter. On AMD GPUs threads are grouped; each warp has its own call
stack and program counter, but not each thread. Warps are described and
explained in the :ref:`inherent_thread_hierarchy`.

If a thread or warp exceeds its stack size, a stack overflow occurs, causing
kernel failure. This can be detected using debuggers.

Call stack management with HIP
===============================================================================

You can adjust the call stack size as shown in the following example, allowing
fine-tuning based on specific kernel requirements. This helps prevent stack
overflow errors by ensuring sufficient stack memory is allocated.

.. code-block:: cpp

#include <hip/hip_runtime.h>
#include <iostream>

#define HIP_CHECK(expression) \
{ \
const hipError_t status = expression; \
if(status != hipSuccess){ \
std::cerr << "HIP error " \
<< status << ": " \
<< hipGetErrorString(status) \
<< " at " << __FILE__ << ":" \
<< __LINE__ << std::endl; \
} \
}

int main()
{
size_t stackSize;
HIP_CHECK(hipDeviceGetLimit(&stackSize, hipLimitStackSize));
std::cout << "Default stack size: " << stackSize << " bytes" << std::endl;

// Set a new stack size
size_t newStackSize = 1024 * 8; // 8 KiB
HIP_CHECK(hipDeviceSetLimit(hipLimitStackSize, newStackSize));

HIP_CHECK(hipDeviceGetLimit(&stackSize, hipLimitStackSize));
std::cout << "Updated stack size: " << stackSize << " bytes" << std::endl;

return 0;
}

Handling recursion and deep function calls
-------------------------------------------------------------------------------

Similar to CPU programming, recursive functions and deeply nested function
calls are supported. However, developers must ensure that these functions do
not exceed the available stack memory, considering the huge amount of memory
needed for the call stack due to the GPUs inherent parallelism. This can be
achieved by increasing stack size, implementing error handling to catch stack
overflow, optimizing code to reduce stack usage, and utilizing profiling tools
to monitor stack memory. Proper kernel design and memory management strategies
are essential to maintain efficient and stable application performance.

.. code-block:: cpp

#include <hip/hip_runtime.h>
#include <iostream>

#define HIP_CHECK(expression) \
{ \
const hipError_t status = expression; \
if(status != hipSuccess){ \
std::cerr << "HIP error " \
<< status << ": " \
<< hipGetErrorString(status) \
<< " at " << __FILE__ << ":" \
<< __LINE__ << std::endl; \
} \
}

__device__ unsigned long long fibonacci(unsigned long long n)
{
if (n == 0 || n == 1)
{
return n;
}
return fibonacci(n - 1) + fibonacci(n - 2);
}

__global__ void kernel(unsigned long long n)
{
unsigned long long result = fibonacci(n);
const size_t x = threadIdx.x + blockDim.x * blockIdx.x;

if (x == 0)
printf("%llu! = %llu \n", n, result);
}

int main()
{
kernel<<<1, 1>>>(10);
HIP_CHECK(hipDeviceSynchronize());

// With -O0 optimization option hit the stack limit
// kernel<<<1, 256>>>(2048);
// HIP_CHECK(hipDeviceSynchronize());

return 0;
}
116 changes: 116 additions & 0 deletions docs/how-to/hip_runtime_api/external_interop.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
.. meta::
:description: HIP provides an external resource interoperability API that
allows efficient data sharing between HIP's computing power and
OpenGL's graphics rendering.
:keywords: AMD, ROCm, HIP, external, interop, interoperability

*******************************************************************************
External resource interoperability
*******************************************************************************

This feature allows HIP to work with resources -- like memory and semaphores --
created by other APIs. This means resources can be used from APIs like CUDA,
OpenCL and Vulkan within HIP, making it easier to integrate HIP into existing
projects.

To use external resources in HIP, you typically follow these steps:

- Import resources from other APIs using HIP provided functions
- Use external resources as if they were created in HIP
- Destroy the HIP resource object to clean up

Semaphore Functions
===============================================================================

Semaphore functions are essential for synchronization in parallel computing.
These functions facilitate communication and coordination between different
parts of a program or between different programs. By managing semaphores, tasks
are executed in the correct order, and resources are utilized effectively.
Semaphore functions ensure smooth operation, preventing conflicts and
maintaining the integrity of processes; upholding the integrity and performance
of concurrent processes.

External semaphore functions can be used in HIP as described in :doc:`../reference/hip_runtime_api/external_interop`.

Memory Functions
===============================================================================

HIP external memory functions focus on the efficient sharing and management of
memory resources. These functions enable importing memory created by external
systems, enabling the HIP program to use this memory seamlessly. Memory
functions include mapping memory for effective use and ensuring proper cleanup
to prevent resource leaks. This is critical for performance, particularly in
applications handling large datasets or complex structures such as textures in
graphics. Proper memory management ensures stability and efficient resource
utilization.

Example
===============================================================================

ROCm examples include a
`HIP--Vulkan interoperation example <https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/vulkan_interop>`_
demonstrates how to perform interoperation between HIP and Vulkan.

In this example, a simple HIP kernel is used to compute a sine wave, which is
then rendered to a window as a graphical output using Vulkan. The process
requires several initialization steps, such as setting up a HIP context,
creating a Vulkan instance, and configuring the GPU device and queue. After
these initial steps, the kernel executes the sine wave computation, and Vulkan
continuously updates the window framebuffer to display the computed data until
the window is closed.

The following code converts a Vulkan memory handle to its equivalent HIP
handle. The input VkDeviceMemory and the created HIP memory represents the same
physical area of GPU memory, through the handles of each respective API.
Writing to the buffer in one API will allow us to read the results through the
other. Note that access to the buffer should be synchronized between the APIs,
for example using queue syncs or semaphores.

.. literalinclude:: ../../tools/example_codes/external_interop.hip
:start-after: // [Sphinx vulkan memory to hip start]
:end-before: // [Sphinx vulkan memory to hip end]
:language: cpp

The Vulkan semaphore is converted to HIP semaphore shown in the following
example. Signaling on the semaphore in one API will allow the other API to wait
on it, which is how we can guarantee synchronized access to resources in a
cross-API manner.

.. literalinclude:: ../../tools/example_codes/external_interop.hip
:start-after: // [Sphinx semaphore convert start]
:end-before: // [Sphinx semaphore convert end]
:language: cpp

When the HIP external memory is exported from Vulkan and imported to HIP, it is
not yet ready for use. The Vulkan handle is shared, allowing for memory sharing
rather than copying during the export process. To actually use the memory, we
need to map it to a pointer so that we may pass it to the kernel so that it can
be read from and written to. The external memory map to HIP in the following
example:

.. literalinclude:: ../../tools/example_codes/external_interop.hip
:start-after: // [Sphinx map external memory start]
:end-before: // [Sphinx map external memory end]
:language: cpp

Wait for buffer is ready and not under modification at Vulkan side:

.. literalinclude:: ../../tools/example_codes/external_interop.hip
:start-after: // [Sphinx wait semaphore start]
:end-before: // [Sphinx wait semaphore end]
:language: cpp

The sinewave kernel implementation:

.. literalinclude:: ../../tools/example_codes/external_interop.hip
:start-after: [Sphinx sinewave kernel start]
:end-before: // [Sphinx sinewave kernel end]
:language: cpp

Signal to Vulkan that we are done with the buffer and that it can proceed with
rendering:

.. literalinclude:: ../../tools/example_codes/external_interop.hip
:start-after: // [Sphinx signal semaphore start]
:end-before: // [Sphinx signal semaphore end]
:language: cpp
Loading
Loading