Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizations and Arch Agnostic Hashinator #20

Merged
merged 187 commits into from
Oct 10, 2023
Merged

Optimizations and Arch Agnostic Hashinator #20

merged 187 commits into from
Oct 10, 2023

Conversation

kstppd
Copy link
Owner

@kstppd kstppd commented Aug 30, 2023

This pull request is a massive one and comes with many updates and upgrades

Fomating and Style updates

  • All source files are clang-formates.
  • Hashinator loses the visual overhead of the predicate stunts for overflown and valid elements in favour of lambdas.
  • Documentation has been added to most of the functions. Doxygen added as well to the repo.

Structural updates

  • Both hashinator and splitvector are now arch agnostic by wrapping over CUDA/HIP functions with macros.
  • The test coverage has been updated.
  • The hasher kernels have been split in 2 version (NVIDIA / AMD).
  • Virtual Warps do work on AMD although internal thread communication is only emulated by sub masking ballot results. This forces full warp syncs but there is no other way for Virtual Warps to work on AMD.

Performance upgrades

  • The insertion performance with the Hasher kernels has been massively improved by a factor of 3 at least.
  • The atomic updates of fill and overflow are now treated differently. Instead of being updated by all winning threads, nowadays there is a 2 step process in updating them. FIrst two warp-wide reductions via registers accumulate the number of added elements and maximum overflow of each warp. Those are stored in shared memory (avoiding any bank conflicts) after the reductions to be available to the first warp of each block. A second stage reductions now lets the first warp know the total elements added per block and the maximum overflow needed (again per block). Only now those quantities are updated. This has massive benefits especially on AMD HW.
  • The memory efficiency of the insertion kernels has been improved by carefully redesigning those to use 128 LDGs whenever possible. This has reduced L1 global excessive and warp stalling by a lot and allowed for higher throughput.

kstppd and others added 30 commits April 13, 2023 10:19
Less device-blocking operations
…city and _size to Splitvec's optimize methods
kstppd and others added 29 commits September 7, 2023 14:39
`is_pod` is deprecated in C++20.
By removing `is_pod` with `is_trivially_contructable` more types can
used via the optimized copy.
For destruction `is_nothrow_destructible` is used to destruct types if
possible.
- fix missing include
- add support for any HIP compiler by using `__HIP__` instead of
  `__HIP_PLATFORM_HCC___`
- fix hard coded warp size for AMD GPUs. The warp size depends on the
  architecture.
@kstppd kstppd merged commit 3fe12b7 into master Oct 10, 2023
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants