diff --git a/docs/tutorials/saxpy.md b/docs/tutorials/saxpy.md index 3dcf354167..1fed2ce118 100644 --- a/docs/tutorials/saxpy.md +++ b/docs/tutorials/saxpy.md @@ -37,11 +37,11 @@ cause and design a fix. ## Your First Piece of HIP Code First, let's take the "Hello, World!" of GPGPU: SAXPY. The name comes from -math, a vector equation at that: `a * x + y = z` where `a ∈ ℝ` is a scalar and -`x,y,z ∈ 𝕍` are vector quantities of some large dimensionality. (Allow us to -omit defining what vectors and vector spaces are in the mathematical sense.) -From a practical perspective we can compute this using a single `for` loop over -3 arrays. +math, a vector equation at that: {math}`a\cdot x+y=z` where +{math}`a\in\mathbb{R}` is a scalar and {math}`x,y,z\in\mathbb{V}` are vector +quantities of some large dimensionality. (Our vector space is defined over the +set of reals.) From a practical perspective we can compute this using a single +`for` loop over 3 arrays. ```c++ for (int i = 0 ; i < N ; ++i) @@ -50,8 +50,8 @@ for (int i = 0 ; i < N ; ++i) In linear algebra libraries, such as BLAS (Basic Linear Algebra Subsystem) this operation is defined as AXPY "A times X Plus Y". The "S" comes from -single-precision, meaning that every element of our array are `float`s (our -vector space was defined over the set of reals). +single-precision, meaning that every element of our array are `float`s. (We +choose IEEE 754: binary32 arithmetic as the representation of our algebra.) To get quickly off the ground, we'll take off-the-shelf piece of code, the set of [HIP samples from GitHub](https://github.com/amd/rocm-examples/). Assuming @@ -90,7 +90,7 @@ device is `0`, which is equivalent to calling `hipSetDevice(0)`. Once our data has been dispatched, we can launch our calculation on the device. -```hip +```cu __global__ void saxpy_kernel(const float a, const float* d_x, float* d_y, const unsigned int size) { ... @@ -122,9 +122,9 @@ First let's discuss the signature of the offloaded function: constructors. Where would that logic execute? On the host? On the device?)_ Pointer arguments are pointers to device memory, one typically backed by VRAM. -- We said that we'll be computing `a * x + y = z`, however we only pass two - pointers to the function. We'll be canonically reusing one of the inputs as - outputs. +- We said that we'll be computing {math}`a\cdot x+y=z`, however we only pass + two pointers to the function. We'll be canonically reusing one of the inputs + as outputs. There's quite a lot to unpack already. How is this function launched from the host? Using a language extension, the so-called triple chevron syntax. Inside @@ -138,7 +138,7 @@ the angle brackets we can provide the following: Following the triple chevron is ordinary function argument passing. Now let's take a look how the kernel is implemented. -```hip +```cu __global__ void saxpy_kernel(const float a, const float* d_x, float* d_y, const unsigned int size) { // Compute the current thread's index in the grid. @@ -160,7 +160,7 @@ __global__ void saxpy_kernel(const float a, const float* d_x, float* d_y, const Retrieval of the result from the device is done much like its dispatch: -```hip +```cu HIP_CHECK(hipMemcpy(y.data(), d_y, size_bytes, hipMemcpyDeviceToHost)); ``` @@ -671,6 +671,14 @@ major.minor: 8.6 major.minor: 7.0 ``` +```{tip} +Next to the `nvcc` executable is another tool called `__nvcc_device_query` +which simply prints the SM Architecture numbers to standard out as a comma +separated list of numbers. The naming of this utility suggests it's not a user +facing executable but is used by `nvcc` to determine what devices are in the +system at hand. +``` + ::: :::{tab-item} Windows & AMD :sync: windows-amd @@ -712,6 +720,14 @@ major.minor: 8.6 major.minor: 7.0 ``` +```{tip} +Next to the `nvcc` executable is another tool called `__nvcc_device_query.exe` +which simply prints the SM Architecture numbers to standard out as a comma +separated list of numbers. The naming of this utility suggests it's not a user +facing executable but is used by `nvcc` to determine what devices are in the +system at hand. +``` + ::: :::: @@ -719,8 +735,8 @@ Now that we know which versions of graphics IP our devices use, we can recompile our program with said parameters. ::::{tab-set} -:::{tab-item} Linux -:sync: linux +:::{tab-item} Linux & AMD +:sync: linux-amd ```bash amdclang++ ./HIP-Basic/saxpy/main.hip -o saxpy -I ./Common -lamdhip64 -L /opt/rocm/lib -O2 --offload-arch=gfx906:sramecc+:xnack- @@ -735,8 +751,29 @@ First 10 elements of the results: [ 3, 5, 7, 9, 11, 13, 15, 17, 19, 21 ] ``` ::: -:::{tab-item} Windows -:sync: windows +:::{tab-item} Linux & NVIDIA +:sync: linux-nvidia + +```bash +nvcc ./HIP-Basic/saxpy/main.hip -o saxpy -I ./Common -I /opt/rocm/include -O2 -x cu -arch=sm_70,sm_86 +``` + +```{tip} +If you want to portably target the development machine which is compiling, you +may specify `-arch=native` instead. +``` + +Now our sample will surely run. + +```none +./saxpy +Calculating y[i] = a * x[i] + y[i] over 1000000 elements. +First 10 elements of the results: [ 3, 5, 7, 9, 11, 13, 15, 17, 19, 21 ] +``` + +::: +:::{tab-item} Windows & AMD +:sync: windows-amd ```pwsh clang++ .\HIP-Basic\saxpy\main.hip -o saxpy.exe -I .\Common -lamdhip64 -L ${env:HIP_PATH}lib -O2 --offload-arch=gfx1032 --offload-arch=gfx1035 @@ -750,5 +787,26 @@ Calculating y[i] = a * x[i] + y[i] over 1000000 elements. First 10 elements of the results: [ 3, 5, 7, 9, 11, 13, 15, 17, 19, 21 ] ``` +::: +:::{tab-item} Windows & NVIDIA +:sync: windows-nvidia + +```pwsh +nvcc .\HIP-Basic\saxpy\main.hip -o saxpy.exe -I ${env:HIP_PATH}include -I .\Common -O2 -x cu -arch=sm_70,sm_86 +``` + +```{tip} +If you want to portably target the development machine which is compiling, you +may specify `-arch=native` instead. +``` + +Now our sample will surely run. + +```none +.\saxpy.exe +Calculating y[i] = a * x[i] + y[i] over 1000000 elements. +First 10 elements of the results: [ 3, 5, 7, 9, 11, 13, 15, 17, 19, 21 ] +``` + ::: ::::