Added a variable sized SoA macro definition. #4

ericcano · 2021-06-28T14:22:16Z

Applied it to define HGCCLUESoA.

Possible modifications for getting automatic definition of variable sized SoA. The SoAs are defined with macros from CUDADataFormats/Common/interface/SoAmacros.h so the package should be add to test.

Applied it to define HGCCLUESoA.

ericcano · 2021-06-28T14:25:47Z

@b-fontana , is this the kind of SoA definitions you were looking for?
@fwyzard, could you have a look too?
I automated the alignement (actually best it to be cache aligned, which is 128 bytes in GPU). For the CPU version, we are 128 bytes aligned within the data bloc, but the bloc allocation might not give a bloc aligned that well. This is probably not an issue on the CPU side.

ericcano · 2021-06-28T14:29:44Z

Of course the same can be applied to (uncalibreated) rec hits as well.

fwyzard · 2021-06-28T21:49:44Z

Hi Eric, I'll try to have a look Thursday or Friday.

bfonta · 2021-06-29T07:33:15Z

CUDADataFormats/HGCal/interface/HGCCLUEGPUProduct.h

+    /* CUDA allocations are already aligned */
+    mMemCLUEDev = cms::cuda::make_device_unique<std::byte[]>(


I think we mean different things here (I have to admit my comment in the code was misleading). I was not trying to align the overall memory block. I was instead making sure the total size of the allocated memory is a multiple of the warpSize (32), and further making sure that each variable within the SoA is also aligned, so that no warp must allocate unnecessary memory blocks (that is why I am later using pad_ when defining the layout of the SoA). If I would not do this, the only variable of the SoA which would surely be aligned would be the first.

ericcano

Indeed, the macros take care of aligning the arrays for efficient access in the GPU. The purpose of the macros is to avoid the repetition of alignment code in all SoA structures.

The code supposes that the data block itself is properly aligned (which is the case with the CUDA allocator, but not guaranteed on the CPU side).
The array alignment can be defined at runtime, but I found only compile-time solutions for allocation alignment on the CPU side.

The default 128 byte alignment should be correct in most situations (see link in the code comment).

ericcano · 2021-06-29T08:03:56Z

CUDADataFormats/Common/interface/SoAmacros.h

+#define _ASSIGN_SOA_COLUMN_OR_SCALAR_IMPL(IS_COLUMN, TYPE, NAME)                                                                    \
+  BOOST_PP_CAT(NAME, _) = reinterpret_cast<TYPE *>(curMem);                                                                         \
+    BOOST_PP_IIF(IS_COLUMN,                                                                                                         \
+    curMem += ((nElements_ * sizeof(TYPE) / byteAlignment_) + 1) * byteAlignment_;                                                  \
+  ,                                                                                                                                 \
+    curMem += ((sizeof(TYPE) / byteAlignment_) + 1) * byteAlignment_;                                                               \
+  )
+
+#define _ASSIGN_SOA_COLUMN_OR_SCALAR(R, DATA, TYPE_NAME)                                                                            \
+  _ASSIGN_SOA_COLUMN_OR_SCALAR_IMPL TYPE_NAME
+
+#define _ACCUMULATE_SOA_ELEMENT_IMPL(IS_COLUMN, TYPE, NAME)                                                                         \
+  BOOST_PP_IIF(IS_COLUMN,                                                                                                           \
+    ret += ((nElements * sizeof(TYPE) / byteAlignment) + 1) * byteAlignment;                                                        \
+  ,                                                                                                                                 \
+    ret += ((sizeof(TYPE) / byteAlignment) + 1) * byteAlignment;                                                                    \
+  )
+
+#define _ACCUMULATE_SOA_ELEMENT(R, DATA, TYPE_NAME)                                                                                 \
+  _ACCUMULATE_SOA_ELEMENT_IMPL TYPE_NAME


This code takes care of aligning the blocks bytewise. Looking at it, a -1 is missing. I will fix that.

ericcano · 2021-06-29T08:10:54Z

CUDADataFormats/Common/interface/SoAmacros.h

+  /* For CUDA applications, we align to the 128 bytes of the cache lines.                                                           \
+   * See https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#global-memory-3-0 this is still valid                     \
+   * up to compute capability 8.X.                                                                                                  \
+   */                                                                                                                               \
+  constexpr static size_t defaultAlignment = 128;                                                                                   \


The default alignment is defined here, and should be a good default.

Re-aligned stray line continuations.

Range checking is enabled by defining the DEBUG macro.

fwyzard · 2021-06-30T10:12:29Z

The code supposes that the data block itself is properly aligned (which is the case with the CUDA allocator, but not guaranteed on the CPU side).
The array alignment can be defined at runtime, but I found only compile-time solutions for allocation alignment on the CPU side.

We can use posix_memalign / free.

Added a variable sized SoA macro definition.

dc0f417

Applied it to define HGCCLUESoA.

bfonta reviewed Jun 29, 2021

View reviewed changes

ericcano commented Jun 29, 2021

View reviewed changes

ericcano added 2 commits June 29, 2021 10:45

Fixed block boundary computations.

4813933

Re-aligned stray line continuations.

Added optional range checking for SoA classes.

c1ba89b

Range checking is enabled by defining the DEBUG macro.

ericcano added 2 commits June 30, 2021 14:43

Added simple unit test validating SoA member alignment.

48ad76c

Added a test kernel running on top of SoA.

7cca04e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added a variable sized SoA macro definition. #4

Added a variable sized SoA macro definition. #4

ericcano commented Jun 28, 2021

ericcano commented Jun 28, 2021

ericcano commented Jun 28, 2021

fwyzard commented Jun 28, 2021 via email

bfonta Jun 29, 2021

ericcano left a comment

ericcano Jun 29, 2021

ericcano Jun 29, 2021

fwyzard commented Jun 30, 2021

		/* CUDA allocations are already aligned */
		mMemCLUEDev = cms::cuda::make_device_unique<std::byte[]>(

Added a variable sized SoA macro definition. #4

Are you sure you want to change the base?

Added a variable sized SoA macro definition. #4

Conversation

ericcano commented Jun 28, 2021

ericcano commented Jun 28, 2021

ericcano commented Jun 28, 2021

fwyzard commented Jun 28, 2021 via email

bfonta Jun 29, 2021

Choose a reason for hiding this comment

ericcano left a comment

Choose a reason for hiding this comment

ericcano Jun 29, 2021

Choose a reason for hiding this comment

ericcano Jun 29, 2021

Choose a reason for hiding this comment

fwyzard commented Jun 30, 2021