Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add initialization and error handling #3632

Merged
merged 1 commit into from
Nov 18, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
134 changes: 134 additions & 0 deletions docs/how-to/hip_runtime_api/error_handling.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
.. meta::
:description: Error Handling
:keywords: AMD, ROCm, HIP, error handling, error

********************************************************************************
Error handling
********************************************************************************

HIP provides functionality to detect, report, and manage errors that occur
neon60 marked this conversation as resolved.
Show resolved Hide resolved
during the execution of HIP runtime functions or when launching kernels. Every
HIP runtime function, apart from launching kernels, has :cpp:type:`hipError_t`
as return type. :cpp:func:`hipGetLastError()` and :cpp:func:`hipPeekAtLastError()`
can be used for catching errors from kernel launches, as kernel launches don't
return an error directly. HIP maintains an internal state, that includes the
last error code. :cpp:func:`hipGetLastError` returns and resets that error to
hipSuccess, while :cpp:func:`hipPeekAtLastError` just returns the error without
changing it. To get a human readable version of the errors,
:cpp:func:`hipGetErrorString()` and :cpp:func:`hipGetErrorName()` can be used.

.. note::

:cpp:func:`hipGetLastError` returns the returned error code of the last HIP
runtime API call even if it's hipSuccess, while ``cudaGetLastError`` returns
the error returned by any of the preceding CUDA APIs in the same host thread.
:cpp:func:`hipGetLastError` behavior will be matched with
``cudaGetLastError`` in ROCm release 7.0.

satyanveshd marked this conversation as resolved.
Show resolved Hide resolved
Best practices of HIP error handling:

1. Check errors after each API call - Avoid error propagation.
2. Use macros for error checking - Check :ref:`hip_check_macros`.
3. Handle errors gracefully - Free resources and provide meaningful error
messages to the user.

For more details on the error handling functions, see :ref:`error handling
functions reference page <error_handling_reference>`.

.. _hip_check_macros:

HIP check macros
================================================================================

HIP uses check macros to simplify error checking and reduce code duplication.
The ``HIP_CHECK`` macros are mainly used to detect and report errors. It can
also exit from application with ``exit(1);`` function call after the error
print. The ``HIP_CHECK`` macro example:

.. code-block:: cpp

#define HIP_CHECK(expression) \
{ \
const hipError_t status = expression; \
if(status != hipSuccess){ \
std::cerr << "HIP error " \
<< status << ": " \
<< hipGetErrorString(status) \
<< " at " << __FILE__ << ":" \
<< __LINE__ << std::endl; \
} \
}

Complete example
================================================================================

A complete example to demonstrate the error handling with a simple addition of
two values kernel:

.. code-block:: cpp

#include <hip/hip_runtime.h>
#include <vector>
#include <iostream>

#define HIP_CHECK(expression) \
{ \
const hipError_t status = expression; \
if(status != hipSuccess){ \
std::cerr << "HIP error " \
<< status << ": " \
<< hipGetErrorString(status) \
<< " at " << __FILE__ << ":" \
<< __LINE__ << std::endl; \
} \
}

// Addition of two values.
__global__ void add(int *a, int *b, int *c, size_t size) {
const size_t index = threadIdx.x + blockDim.x * blockIdx.x;
if(index < size) {
c[index] += a[index] + b[index];
}
}

int main() {
constexpr int numOfBlocks = 256;
constexpr int threadsPerBlock = 256;
constexpr size_t arraySize = 1U << 16;

std::vector<int> a(arraySize), b(arraySize), c(arraySize);
int *d_a, *d_b, *d_c;

// Setup input values.
std::fill(a.begin(), a.end(), 1);
std::fill(b.begin(), b.end(), 2);

// Allocate device copies of a, b and c.
HIP_CHECK(hipMalloc(&d_a, arraySize * sizeof(*d_a)));
HIP_CHECK(hipMalloc(&d_b, arraySize * sizeof(*d_b)));
HIP_CHECK(hipMalloc(&d_c, arraySize * sizeof(*d_c)));

// Copy input values to device.
HIP_CHECK(hipMemcpy(d_a, &a, arraySize * sizeof(*d_a), hipMemcpyHostToDevice));
HIP_CHECK(hipMemcpy(d_b, &b, arraySize * sizeof(*d_b), hipMemcpyHostToDevice));

// Launch add() kernel on GPU.
hipLaunchKernelGGL(add, dim3(numOfBlocks), dim3(threadsPerBlock), 0, 0, d_a, d_b, d_c, arraySize);
// Check the kernel launch
HIP_CHECK(hipGetLastError());
// Check for kernel execution error
HIP_CHECK(hipDeviceSynchronize());

// Copy the result back to the host.
HIP_CHECK(hipMemcpy(&c, d_c, arraySize * sizeof(*d_c), hipMemcpyDeviceToHost));

// Cleanup allocated memory.
HIP_CHECK(hipFree(d_a));
HIP_CHECK(hipFree(d_b));
HIP_CHECK(hipFree(d_c));

// Print the result.
std::cout << a[0] << " + " << b[0] << " = " << c[0] << std::endl;

return 0;
}
105 changes: 105 additions & 0 deletions docs/how-to/hip_runtime_api/initialization.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
.. meta::
:description: Initialization.
:keywords: AMD, ROCm, HIP, initialization

********************************************************************************
Initialization
********************************************************************************

The initialization involves setting up the environment and resources needed for
using GPUs. The following steps are covered with the initialization:

- Setting up the HIP runtime

This includes reading the environment variables set during init, setting up
the active or visible devices, loading necessary libraries, setting up
internal buffers for memory copies or cooperative launches, initialize the
compiler as well as HSA runtime and checks any errors due to lack of resources
or no active devices.

- Querying and setting GPUs

Identifying and querying the available GPU devices on the system.

- Setting up contexts
neon60 marked this conversation as resolved.
Show resolved Hide resolved

Creating contexts for each GPU device, which are essential for managing
neon60 marked this conversation as resolved.
Show resolved Hide resolved
resources and executing kernels. For further details, check the :ref:`context
section <context_driver_api>`.

Initialize the HIP runtime
neon60 marked this conversation as resolved.
Show resolved Hide resolved
================================================================================

The HIP runtime is initialized automatically when the first HIP API call is
made. However, you can explicitly initialize it using :cpp:func:`hipInit`,
to be able to control the timing of the initialization. The manual
initialization can be useful to ensure that the GPU is initialized and
ready, or to isolate GPU initialization time from other parts of
your program.

satyanveshd marked this conversation as resolved.
Show resolved Hide resolved
.. note::

You can use :cpp:func:`hipDeviceReset()` to delete all streams created, memory
allocated, kernels running and events created by the current process. Any new
HIP API call initializes the HIP runtime again.

Querying and setting GPUs
================================================================================

If multiple GPUs are available in the system, you can query and select the
desired GPU(s) to use based on device properties, such as size of global memory,
size shared memory per block, support of cooperative launch and support of
managed memory.

Querying GPUs
--------------------------------------------------------------------------------

The properties of a GPU can be queried using :cpp:func:`hipGetDeviceProperties`,
which returns a struct of :cpp:struct:`hipDeviceProp_t`. The properties in the struct can be
used to identify a device or give an overview of hardware characteristics, that
might make one GPU better suited for the task than others.

MKKnorr marked this conversation as resolved.
Show resolved Hide resolved
The :cpp:func:`hipGetDeviceCount` function returns the number of available GPUs,
which can be used to loop over the available GPUs.

Example code of querying GPUs:

.. code-block:: cpp

#include <hip/hip_runtime.h>
#include <iostream>

int main() {

int deviceCount;
if (hipGetDeviceCount(&deviceCount) == hipSuccess){
for (int i = 0; i < deviceCount; ++i){
hipDeviceProp_t prop;
if ( hipGetDeviceProperties(&prop, i) == hipSuccess)
std::cout << "Device" << i << prop.name << std::endl;
neon60 marked this conversation as resolved.
Show resolved Hide resolved
}
}

return 0;
}

Setting the GPU
--------------------------------------------------------------------------------

:cpp:func:`hipSetDevice` function select the GPU to be used for subsequent HIP
operations. This function performs several key tasks:

- Context Binding

Binds the current thread to the context of the specified GPU device. This
ensures that all subsequent operations are executed on the selected device.

- Resource Allocation

Prepares the device for resource allocation, such as memory allocation and
stream creation.

- Check device availability

Checks for errors in device selection and returns error if the specified
device is not available or not capable of executing HIP operations.
2 changes: 2 additions & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,9 @@ The HIP documentation is organized into the following categories:
:::{grid-item-card} How to

* {doc}`./how-to/hip_runtime_api`
neon60 marked this conversation as resolved.
Show resolved Hide resolved
* {doc}`./how-to/hip_runtime_api/initialization`
* {doc}`./how-to/hip_runtime_api/memory_management`
* {doc}`./how-to/hip_runtime_api/error_handling`
* {doc}`./how-to/hip_runtime_api/cooperative_groups`
* {doc}`./how-to/hip_runtime_api/hipgraph`
* [HIP porting guide](./how-to/hip_porting_guide)
Expand Down
2 changes: 2 additions & 0 deletions docs/sphinx/_toc.yml.in
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ subtrees:
- file: how-to/hip_runtime_api
subtrees:
- entries:
- file: how-to/hip_runtime_api/initialization
- file: how-to/hip_runtime_api/memory_management
subtrees:
- entries:
Expand All @@ -46,6 +47,7 @@ subtrees:
- file: how-to/hip_runtime_api/memory_management/unified_memory
- file: how-to/hip_runtime_api/memory_management/virtual_memory
- file: how-to/hip_runtime_api/memory_management/stream_ordered_allocator
- file: how-to/hip_runtime_api/error_handling
- file: how-to/hip_runtime_api/cooperative_groups
- file: how-to/hip_runtime_api/hipgraph
- file: how-to/hip_porting_guide
Expand Down