Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce stream syncs in split tools #40

Closed
wants to merge 2 commits into from
Closed

Reduce stream syncs in split tools #40

wants to merge 2 commits into from

Conversation

kstppd
Copy link
Owner

@kstppd kstppd commented Feb 7, 2024

Reduces number of stream syncs in spit tools.

Copy link
Collaborator

@markusbattarbee markusbattarbee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more optimization notes

@@ -779,6 +766,7 @@ void copy_keys_if(split::SplitVector<T, split::split_unified_allocator<T>>& inpu
const size_t memory_for_pool = 8 * nBlocks * sizeof(uint32_t);
Cuda_mempool mPool(memory_for_pool, s);
auto len = copy_keys_if_raw(input, output.data(), rule, nBlocks, mPool, s);
SPLIT_CHECK_ERR(split_gpuStreamSynchronize(s));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is not needed. Copy_keys_if_raw has a stream sync before returning.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in a new PR

@@ -622,14 +622,12 @@ uint32_t copy_if_raw(split::SplitVector<T, split::split_unified_allocator<T>>& i
uint32_t* d_counts;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

up above I see what I believe are unnecessary syncs:
on lines 595, 603, 611

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as below

@@ -622,14 +622,12 @@ uint32_t copy_if_raw(split::SplitVector<T, split::split_unified_allocator<T>>& i
uint32_t* d_counts;
uint32_t* d_offsets;
d_counts = (uint32_t*)mPool.allocate(nBlocks * sizeof(uint32_t));
SPLIT_CHECK_ERR(split_gpuStreamSynchronize(s));
SPLIT_CHECK_ERR(split_gpuMemsetAsync(d_counts, 0, nBlocks * sizeof(uint32_t),s));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

d_counts here is passed to scan_reduce_raw as output, and in that kernel it gets directly written to, not incremented. Thus, the memset appears unnecessary, as long as the kernel actually writes to all elements. This same logic check should be done to the other memsets as well.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be addressed later as I am in a bit of a git wreck

@kstppd
Copy link
Owner Author

kstppd commented Mar 1, 2024

This can be close as it is encapsulated in PR #48.

@kstppd kstppd closed this Mar 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants