Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added a variable sized SoA macro definition. #4

Open
wants to merge 5 commits into
base: clue_porting_to_cmssw
Choose a base branch
from

Conversation

ericcano
Copy link

Applied it to define HGCCLUESoA.

Possible modifications for getting automatic definition of variable sized SoA. The SoAs are defined with macros from CUDADataFormats/Common/interface/SoAmacros.h so the package should be add to test.

Applied it to define HGCCLUESoA.
@ericcano
Copy link
Author

@b-fontana , is this the kind of SoA definitions you were looking for?
@fwyzard, could you have a look too?
I automated the alignement (actually best it to be cache aligned, which is 128 bytes in GPU). For the CPU version, we are 128 bytes aligned within the data bloc, but the bloc allocation might not give a bloc aligned that well. This is probably not an issue on the CPU side.

@ericcano
Copy link
Author

Of course the same can be applied to (uncalibreated) rec hits as well.

@fwyzard
Copy link

fwyzard commented Jun 28, 2021 via email

Comment on lines +15 to +16
/* CUDA allocations are already aligned */
mMemCLUEDev = cms::cuda::make_device_unique<std::byte[]>(
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we mean different things here (I have to admit my comment in the code was misleading). I was not trying to align the overall memory block. I was instead making sure the total size of the allocated memory is a multiple of the warpSize (32), and further making sure that each variable within the SoA is also aligned, so that no warp must allocate unnecessary memory blocks (that is why I am later using pad_ when defining the layout of the SoA). If I would not do this, the only variable of the SoA which would surely be aligned would be the first.

Copy link
Author

@ericcano ericcano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, the macros take care of aligning the arrays for efficient access in the GPU. The purpose of the macros is to avoid the repetition of alignment code in all SoA structures.

The code supposes that the data block itself is properly aligned (which is the case with the CUDA allocator, but not guaranteed on the CPU side).
The array alignment can be defined at runtime, but I found only compile-time solutions for allocation alignment on the CPU side.

The default 128 byte alignment should be correct in most situations (see link in the code comment).

Comment on lines 95 to 114
#define _ASSIGN_SOA_COLUMN_OR_SCALAR_IMPL(IS_COLUMN, TYPE, NAME) \
BOOST_PP_CAT(NAME, _) = reinterpret_cast<TYPE *>(curMem); \
BOOST_PP_IIF(IS_COLUMN, \
curMem += ((nElements_ * sizeof(TYPE) / byteAlignment_) + 1) * byteAlignment_; \
, \
curMem += ((sizeof(TYPE) / byteAlignment_) + 1) * byteAlignment_; \
)

#define _ASSIGN_SOA_COLUMN_OR_SCALAR(R, DATA, TYPE_NAME) \
_ASSIGN_SOA_COLUMN_OR_SCALAR_IMPL TYPE_NAME

#define _ACCUMULATE_SOA_ELEMENT_IMPL(IS_COLUMN, TYPE, NAME) \
BOOST_PP_IIF(IS_COLUMN, \
ret += ((nElements * sizeof(TYPE) / byteAlignment) + 1) * byteAlignment; \
, \
ret += ((sizeof(TYPE) / byteAlignment) + 1) * byteAlignment; \
)

#define _ACCUMULATE_SOA_ELEMENT(R, DATA, TYPE_NAME) \
_ACCUMULATE_SOA_ELEMENT_IMPL TYPE_NAME
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code takes care of aligning the blocks bytewise. Looking at it, a -1 is missing. I will fix that.

Comment on lines +206 to +210
/* For CUDA applications, we align to the 128 bytes of the cache lines. \
* See https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#global-memory-3-0 this is still valid \
* up to compute capability 8.X. \
*/ \
constexpr static size_t defaultAlignment = 128; \
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default alignment is defined here, and should be a good default.

ericcano added 2 commits June 29, 2021 10:45
Re-aligned stray line continuations.
Range checking is enabled by defining the DEBUG macro.
@fwyzard
Copy link

fwyzard commented Jun 30, 2021

The code supposes that the data block itself is properly aligned (which is the case with the CUDA allocator, but not guaranteed on the CPU side).
The array alignment can be defined at runtime, but I found only compile-time solutions for allocation alignment on the CPU side.

We can use posix_memalign / free.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants