Fix race detected in Parquet writer #14598

etseidl · 2023-12-08T17:59:58Z

Description

While investigating #14597 it was found that there was a race introduced by #14000. Adding a sync between invocations of a block reduce resolves the error.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2023-12-08T18:00:03Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

davidwendt · 2023-12-08T18:03:25Z

/ok to test

PointKernel

LGTM

PointKernel · 2023-12-08T18:16:48Z

cpp/src/io/parquet/page_enc.cu

@@ -206,7 +206,8 @@ void __device__ calculate_frag_size(frag_init_state_s* const s, int t)
    }
  }
  __syncthreads();
-  auto const total_len   = block_reduce(reduce_storage).Sum(len);
+  auto const total_len = block_reduce(reduce_storage).Sum(len);
+  __syncthreads();


Right, I hesitated to bring this up since the CI didn't complain. Generally, a sync in-between is required when the same temp storage is used for different reductions.

I was worried too, but dug into the Sum code and saw that there were syncthreads inside and convinced myself it was ok :(

ttnghia · 2023-12-08T18:30:39Z

cpp/src/io/parquet/page_enc.cu

  __syncthreads();
-  auto const total_len   = block_reduce(reduce_storage).Sum(len);
+  auto const total_len = block_reduce(reduce_storage).Sum(len);
+  __syncthreads();


So we really need two sync like this?

perhaps the first one is unecessary, but now I'm scared to change it 😬

hm, maybe we should remove it if racecheck does not complain

I'll try removing it and see...

Yep, no racecheck problems (beyond the nvcomp and decode ones) without the first sync.

What is racecheck?

I'll try removing it and see

trying this on my end as well

What is racecheck?

racecheck tool in compute-sanitizer

no racecheck errors, and, looking at the code, no shared data is modified before the reductions. IMHO we should be good to remove the first sync.

vuule

Thank you for looking into this issue promptly!
Hopefully that's the last of the ManyFragments failures.

…k_reduce_sync

PointKernel · 2023-12-08T19:39:37Z

/ok to test

vuule · 2023-12-09T00:05:07Z

/merge

While investigating rapidsai#14597 it was found that there was a race introduced by rapidsai#14000. Adding a sync between invocations of a block reduce resolves the error. Authors: - Ed Seidl (https://github.com/etseidl) Approvers: - Yunsong Wang (https://github.com/PointKernel) - Vukasin Milovanovic (https://github.com/vuule) - Nghia Truong (https://github.com/ttnghia) URL: rapidsai#14598

etseidl and others added 2 commits December 8, 2023 09:56

add sync between block_reduce calls

c5b16ec

Merge branch 'rapidsai:branch-24.02' into block_reduce_sync

6f88e72

etseidl requested a review from a team as a code owner December 8, 2023 17:59

etseidl requested review from vyasr and nvdbaranec December 8, 2023 18:00

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Dec 8, 2023

davidwendt added bug Something isn't working 3 - Ready for Review Ready for review by team cuIO cuIO issue non-breaking Non-breaking change labels Dec 8, 2023

PointKernel approved these changes Dec 8, 2023

View reviewed changes

ttnghia reviewed Dec 8, 2023

View reviewed changes

vuule approved these changes Dec 8, 2023

View reviewed changes

ttnghia approved these changes Dec 8, 2023

View reviewed changes

etseidl added 2 commits December 8, 2023 10:51

remove unnecessary syncthreads

41b2ea5

Merge branch 'block_reduce_sync' of github.com:etseidl/cudf into bloc…

c7701c4

…k_reduce_sync

shwina mentioned this pull request Dec 8, 2023

Remove legacy benchmarks for cuDF-python #14591

Merged

3 tasks

rapids-bot bot merged commit 899e392 into rapidsai:branch-24.02 Dec 9, 2023
67 checks passed

etseidl deleted the block_reduce_sync branch December 9, 2023 01:02

etseidl mentioned this pull request Dec 16, 2023

[BUG] Random Parquet CI failures #14597

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix race detected in Parquet writer #14598

Fix race detected in Parquet writer #14598

etseidl commented Dec 8, 2023

copy-pr-bot bot commented Dec 8, 2023

davidwendt commented Dec 8, 2023

PointKernel left a comment

PointKernel Dec 8, 2023

etseidl Dec 8, 2023

ttnghia Dec 8, 2023

etseidl Dec 8, 2023

vuule Dec 8, 2023

etseidl Dec 8, 2023 •

edited

Loading

ttnghia Dec 8, 2023

vuule Dec 8, 2023 •

edited

Loading

etseidl Dec 8, 2023

vuule Dec 8, 2023

vuule left a comment

PointKernel commented Dec 8, 2023

vuule commented Dec 9, 2023

Fix race detected in Parquet writer #14598

Fix race detected in Parquet writer #14598

Conversation

etseidl commented Dec 8, 2023

Description

Checklist

copy-pr-bot bot commented Dec 8, 2023

davidwendt commented Dec 8, 2023

PointKernel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

etseidl Dec 8, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vuule Dec 8, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vuule left a comment

Choose a reason for hiding this comment

PointKernel commented Dec 8, 2023

vuule commented Dec 9, 2023

etseidl Dec 8, 2023 •

edited

Loading

vuule Dec 8, 2023 •

edited

Loading