Skip to content

Potential optimizations

Seth R. Johnson edited this page Mar 29, 2023 · 17 revisions
  • Return early from interact kernels before constructing thread views (@pcanal)
  • Pull RNG state into local memory in RNGEngine constructor, then writing back to global memory in the RNGEngine destructor (#94, no effect before Collection refactor, but substantially reduces load/store in simple case, see https://github.com/celeritas-project/cuda-test-snippets/commit/84c6f601cb017c55afcee49585c9dca1ad1f511f)
  • Rearrange memory layout of large temporary data storage to have more struct-of-array accesses for improved coalescing: stride MaterialTrackView::element_scratch aligned and strided by number of tracks and padded; same with PhysicsTrackView::per_process_xs
  • change particle data to have energy and def_id as separate contiguous arrays (#94, no effect)
  • Possibly allow inter-thread cooperation, refactoring track views and such so that they have null-ops for inactivate threads (except when being cooperative with other threads) rather than just returning early
  • For EM cross sections: instead of splitting the energy range into a regular scaling and a 1/E scaling, store the actual cross section values and just change the interpolation from special-casing 1/E to using log/semilog interpolation.
  • For models like positron annihilation->two gammas, photoelectric scattering, and Moller-Bhabha — where a substantial part of the kernel is completely different — break the if blocks (if at rest, if electron vs positron, if xrays are enabled) into separate helper classes that can serve as template arguments for the interactors. Then try instantiating the interactors on the different templates to see how that affects kernel size/register usage. Then try splitting into multiple model IDs.
  • Add an iterator facade class for accessing "global" Collection data that calls ldg__ for const-reference access.
  • Consider other helper functions for cache optimizations: see https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#cache-operators .
  • Maybe we can use __builtin_assume(__isGlobal(p)) in the Container to help the optimizer realize the data are all to global memory? (Would require CUDA 11.1+)
  • Add special cases for RNG sampler to avoid extra unnecessary instructions/zero-adds
  • Store log(energy) in particle track and use for lookups in energy grid etc. (define Quantity)
  • When we have a more complex demo app, try out the "preallocate one secondary" again, even though that will complicate the code.
  • If the preallocated secondary (one per thread) isn't an improvement, try adding ThreadId to each secondary, then have the secondary postprocessing kernel be launched in parallel over secondaries rather than primaries.
  • Define ImplicitlySizedCollection whose "references" only store a pointer, not a span -- for things like states that have many collections of the same size, or StackAllocator.atomic_size which has a fixed size of 1. This will reduce the amount of constant memory used (smaller kernel arguments sent at launch), but I have no idea if that could make anything faster in practice. Maybe allow the compiler to optimize better?
  • Switch column/row ordering on cutoffs or other data structures to see if memory locality is affected.
  • Add extra short-circuiting logic in use_integral_xs
  • Specialize celeritas::min and ::max to use std::fmin/fmax for real numbers -- it uses a builtin instruction rather than a conditional.
  • Add a size_type template parameter for Span so we can use 32-bit
  • Try using binary search instead of linear search on micro xs CDFs
  • Reduce memory usage by dropping micro xs CDF data from the last element

Reviewing the CUDA best practices guide hints at a few other potentials:

  • Use signed integers for indexing, rather than unsigned, so the compiler doesn't have to worry about overflow
  • Instead of sampling on [0, 2pi), provide a function/functor that does the rotate-from-spherical by sampling on [0, 1) and using the special sinpi and cospi functions. (Note: use Quantity<Revolution> for that version to overload.)
  • For CUDA >= 11.2 there is a __builtin_assume function for hinting to the compiler. This could be added to the Container for array access hints, etc, and it also used as a replacement for the ASSERT macros when debugging is disabled: though many assertions like bounds and nullptr checking are already "assumed" by the compiler. We'd probably want another category of assertions if there was solid evidence for potential speedup.
  • Add alignment support when building Collections
Clone this wiki locally