-
Notifications
You must be signed in to change notification settings - Fork 904
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix race detected in Parquet writer #14598
Conversation
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -206,7 +206,8 @@ void __device__ calculate_frag_size(frag_init_state_s* const s, int t) | |||
} | |||
} | |||
__syncthreads(); | |||
auto const total_len = block_reduce(reduce_storage).Sum(len); | |||
auto const total_len = block_reduce(reduce_storage).Sum(len); | |||
__syncthreads(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I hesitated to bring this up since the CI didn't complain. Generally, a sync in-between is required when the same temp storage is used for different reductions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was worried too, but dug into the Sum code and saw that there were syncthreads inside and convinced myself it was ok :(
cpp/src/io/parquet/page_enc.cu
Outdated
__syncthreads(); | ||
auto const total_len = block_reduce(reduce_storage).Sum(len); | ||
auto const total_len = block_reduce(reduce_storage).Sum(len); | ||
__syncthreads(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we really need two sync like this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps the first one is unecessary, but now I'm scared to change it 😬
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, maybe we should remove it if racecheck does not complain
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try removing it and see...
Yep, no racecheck problems (beyond the nvcomp and decode ones) without the first sync.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is racecheck?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try removing it and see
trying this on my end as well
What is racecheck?
racecheck tool in compute-sanitizer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no racecheck errors, and, looking at the code, no shared data is modified before the reductions. IMHO we should be good to remove the first sync.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for looking into this issue promptly!
Hopefully that's the last of the ManyFragments
failures.
/ok to test |
/merge |
While investigating rapidsai#14597 it was found that there was a race introduced by rapidsai#14000. Adding a sync between invocations of a block reduce resolves the error. Authors: - Ed Seidl (https://github.com/etseidl) Approvers: - Yunsong Wang (https://github.com/PointKernel) - Vukasin Milovanovic (https://github.com/vuule) - Nghia Truong (https://github.com/ttnghia) URL: rapidsai#14598
Description
While investigating #14597 it was found that there was a race introduced by #14000. Adding a sync between invocations of a block reduce resolves the error.
Checklist