Optimizations and Arch Agnostic Hashinator #20

kstppd · 2023-08-30T13:22:50Z

This pull request is a massive one and comes with many updates and upgrades

Fomating and Style updates

All source files are clang-formates.
Hashinator loses the visual overhead of the predicate stunts for overflown and valid elements in favour of lambdas.
Documentation has been added to most of the functions. Doxygen added as well to the repo.

Structural updates

Both hashinator and splitvector are now arch agnostic by wrapping over CUDA/HIP functions with macros.
The test coverage has been updated.
The hasher kernels have been split in 2 version (NVIDIA / AMD).
Virtual Warps do work on AMD although internal thread communication is only emulated by sub masking ballot results. This forces full warp syncs but there is no other way for Virtual Warps to work on AMD.

Performance upgrades

The insertion performance with the Hasher kernels has been massively improved by a factor of 3 at least.
The atomic updates of fill and overflow are now treated differently. Instead of being updated by all winning threads, nowadays there is a 2 step process in updating them. FIrst two warp-wide reductions via registers accumulate the number of added elements and maximum overflow of each warp. Those are stored in shared memory (avoiding any bank conflicts) after the reductions to be available to the first warp of each block. A second stage reductions now lets the first warp know the total elements added per block and the maximum overflow needed (again per block). Only now those quantities are updated. This has massive benefits especially on AMD HW.
The memory efficiency of the insertion kernels has been improved by carefully redesigning those to use 128 LDGs whenever possible. This has reduced L1 global excessive and warp stalling by a lot and allowed for higher throughput.

Less device-blocking operations

…e end.

…up debug prints in hashmap unit test

…ingly

… faults

…city and _size to Splitvec's optimize methods

…warpAccessors is behind an ifdef

… the current size

…r that

NOMAD Hackathon

`is_pod` is deprecated in C++20. By removing `is_pod` with `is_trivially_contructable` more types can used via the optimized copy. For destruction `is_nothrow_destructible` is used to destruct types if possible.

- fix missing include - add support for any HIP compiler by using `__HIP__` instead of `__HIP_PLATFORM_HCC___` - fix hard coded warp size for AMD GPUs. The warp size depends on the architecture.

remove deprecated `is_pod`

support for any HIP compiler

…so some hashers.h is now split into 3 files

Delete optimization

kstppd and others added 30 commits April 13, 2023 10:19

do not call destructors in splitvec's clear when dealing with POD

e357653

This can throw so do not mark it no except

895ad28

Add method to extract keys based on pattern

8f30f3b

make extractKeysByPattern return the total size of elements if found

15aeb8c

Pfff have to wait for the sync first

ec28f2e

Made mallocs and frees async, added streams to optimization calls

e92532b

Merge branch 'dev' into noblocks

74eb3a6

Added stream to upload() calls in split_tools

394f7e2

Merge pull request #15 from markusbattarbee/noblocks

3aa5ee5

Less device-blocking operations

Bring back custom hash pair to support HIP compilation at some point

852180c

Fixes to mempool that were wrecking havoc

4ae2e15

Add couple of operators and a make_pair function to hash_pair

6a95d77

Make the extract keys method non blocking as well

3e049e1

Extract patter bug: Resize up to fill otherwise we need to trim at th…

b86a1e4

…e end.

Add host device qualifiers to hash_pair

10ceee3

Fix nagging warning

1a6601c

Remove blocking methods for cleaning tombstones and rehashing. Clean …

192a6e9

…up debug prints in hashmap unit test

Ignore compile_commands.json ccls file

5294e06

Add +/- iterator operators to splitvector and update unit test accord…

19366b5

…ingly

copyMetadata method for both Hashinator and Splitvector to avoid page…

81b2b62

… faults

Make copyMetadata only available if compiling for GPU. Also add _capa…

a1d8d72

…city and _size to Splitvec's optimize methods

Resize vector corrretly extractKeysByPattern

983d974

Fix pointer syntax with extract methods

e67c6b7

Some warning fixes

5f8eed5

Again warning fixes

8511eda

Add cuda runtime to split allocators

c1b75c2

Enable lambbdas in compaction methods

bf9e7c2

Reform extractAllKeys to use extractKeysByPattern

4ba2388

Add benchmark script

5d515f0

Added cli script

efa1b32

kstppd and others added 29 commits September 7, 2023 14:39

Another unit test update

8c04606

warpInsert and warpInsert_V can now skip overwrites. Safety check in …

3e5491c

…warpAccessors is behind an ifdef

Add unit test for not overwriting vals in new warp accesors

0c68a23

Avoid reallocation in device_rehash if the requested size is equal to…

e8061e3

… the current size

Update tbPerf etst

cc68426

Enable compactions from non power of 2 vectors and add a unit test fo…

196bd1e

…r that

Resolve conflict

cbf7f3c

Typo

803d9bf

Merge pull request #24 from kstppd/NOMAD

ba0faf1

NOMAD Hackathon

Fix uninitialize memory in non power of 2 compactions

b898600

Copy_if is now a wrapper over copy_if_raw + some formatting changes

0ecfea1

Make copy_keys_if void

a276eff

remove deprecated is_pod

37b3c1c

`is_pod` is deprecated in C++20. By removing `is_pod` with `is_trivially_contructable` more types can used via the optimized copy. For destruction `is_nothrow_destructible` is used to destruct types if possible.

support for any HIP compiler

5a5a160

- fix missing include - add support for any HIP compiler by using `__HIP__` instead of `__HIP_PLATFORM_HCC___` - fix hard coded warp size for AMD GPUs. The warp size depends on the architecture.

Merge pull request #25 from psychocoderHPC/topic-removeIsPod

857eb77

remove deprecated `is_pod`

Fix blunder in copy keys if

e17fb1e

Update README.md

fc1a914

guard against non integrals at compile time

078890d

Merge pull request #26 from psychocoderHPC/fix-hipWarpSize

b7f3f5a

support for any HIP compiler

Fix a typo and CPU only compilation from PR#26

0c6f4f1

Port the optimizations for insertion to the delete kernel as well. Al…

20b1f68

…so some hashers.h is now split into 3 files

Do NOT use uninitialized memory :)

d34467b

Get rid of not needed submasking in nvidia kernels

96463ba

Remove deprecated delete kernel

caa0fcc

Migrate makefile

a74ae11

Remove some deprecated features and update makefile

bc37739

Remove some deprecated features and update makefile

45915b3

Merge pull request #27 from kstppd/delete_optimization

fd889d9

Delete optimization

Typo

0043fd8

kstppd merged commit 3fe12b7 into master Oct 10, 2023
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizations and Arch Agnostic Hashinator #20

Optimizations and Arch Agnostic Hashinator #20

kstppd commented Aug 30, 2023

Optimizations and Arch Agnostic Hashinator #20

Optimizations and Arch Agnostic Hashinator #20

Conversation

kstppd commented Aug 30, 2023

Fomating and Style updates

Structural updates

Performance upgrades