-
Notifications
You must be signed in to change notification settings - Fork 35
Potential optimizations
Seth R. Johnson edited this page Mar 29, 2023
·
17 revisions
- Return early from interact kernels before constructing thread views (@pcanal)
- Pull RNG state into local memory in RNGEngine constructor, then writing back to global memory in the RNGEngine destructor (#94, no effect before Collection refactor, but substantially reduces load/store in simple case, see https://github.com/celeritas-project/cuda-test-snippets/commit/84c6f601cb017c55afcee49585c9dca1ad1f511f)
- Rearrange memory layout of large temporary data storage to have more struct-of-array accesses for improved coalescing: stride
MaterialTrackView::element_scratch
aligned and strided by number of tracks and padded; same withPhysicsTrackView::per_process_xs
- change particle data to have energy and def_id as separate contiguous arrays (#94, no effect)
- Possibly allow inter-thread cooperation, refactoring track views and such so that they have null-ops for inactivate threads (except when being cooperative with other threads) rather than just returning early
- For EM cross sections: instead of splitting the energy range into a regular scaling and a 1/E scaling, store the actual cross section values and just change the interpolation from special-casing 1/E to using log/semilog interpolation.
- For models like positron annihilation->two gammas, photoelectric scattering, and Moller-Bhabha — where a substantial part of the kernel is completely different — break the
if
blocks (if at rest, if electron vs positron, if xrays are enabled) into separate helper classes that can serve as template arguments for the interactors. Then try instantiating the interactors on the different templates to see how that affects kernel size/register usage. Then try splitting into multiple model IDs. - Add an iterator facade class for accessing "global"
Collection
data that callsldg__
for const-reference access. - Consider other helper functions for cache optimizations: see https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#cache-operators .
- Maybe we can use
__builtin_assume(__isGlobal(p))
in the Container to help the optimizer realize the data are all to global memory? (Would require CUDA 11.1+) - Add special cases for RNG sampler to avoid extra unnecessary instructions/zero-adds
- Store
log(energy)
in particle track and use for lookups in energy grid etc. (define Quantity) - When we have a more complex demo app, try out the "preallocate one secondary" again, even though that will complicate the code.
- If the preallocated secondary (one per thread) isn't an improvement, try adding
ThreadId
to each secondary, then have the secondary postprocessing kernel be launched in parallel over secondaries rather than primaries. - Define
ImplicitlySizedCollection
whose "references" only store a pointer, not a span -- for things like states that have many collections of the same size, or StackAllocator.atomic_size which has a fixed size of 1. This will reduce the amount of constant memory used (smaller kernel arguments sent at launch), but I have no idea if that could make anything faster in practice. Maybe allow the compiler to optimize better? - Switch column/row ordering on cutoffs or other data structures to see if memory locality is affected.
- Add extra short-circuiting logic in
use_integral_xs
- Specialize
celeritas::min
and::max
to usestd::fmin
/fmax
for real numbers -- it uses a builtin instruction rather than a conditional. - Add a size_type template parameter for Span so we can use 32-bit
- Try using binary search instead of linear search on micro xs CDFs
- Reduce memory usage by dropping micro xs CDF data from the last element
Reviewing the CUDA best practices guide hints at a few other potentials:
- Use signed integers for indexing, rather than unsigned, so the compiler doesn't have to worry about overflow
- Instead of sampling on
[0, 2pi)
, provide a function/functor that does the rotate-from-spherical by sampling on[0, 1)
and using the specialsinpi
andcospi
functions. (Note: useQuantity<Revolution>
for that version to overload.) - For CUDA >= 11.2 there is a
__builtin_assume
function for hinting to the compiler. This could be added to theContainer
for array access hints, etc, and it also used as a replacement for theASSERT
macros when debugging is disabled: though many assertions like bounds and nullptr checking are already "assumed" by the compiler. We'd probably want another category of assertions if there was solid evidence for potential speedup. - Add alignment support when building
Collection
s