-
Notifications
You must be signed in to change notification settings - Fork 361
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP][CELEBORN-1319] Optimize skew partition logic for Reduce Mode to avoid sorting shuffle files #2373
base: main
Are you sure you want to change the base?
[WIP][CELEBORN-1319] Optimize skew partition logic for Reduce Mode to avoid sorting shuffle files #2373
Conversation
client/src/main/java/org/apache/celeborn/client/read/CelebornInputStream.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @wangshengjie123 for this PR! I left some comments. In addition, is the small change to Spark missing?
client/src/main/java/org/apache/celeborn/client/read/CelebornInputStream.java
Outdated
Show resolved
Hide resolved
|
||
int step = locations.length / subPartitionSize; | ||
|
||
// if partition location is [1,2,3,4,5,6,7,8,9,10], and skew partition split to 3 task: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems the logic should be like this:
// if partition location is [1,2,3,4,5,6,7,8,9,10], and skew partition split to 3 task:
// task 0: 1, 4, 7, 10
// task 1: 2, 4, 8
// task 2: 3, 5, 9
for (int i = 0; i < step + 1; i++) {
int index = i * step + subPartitionIndex;
if (index < locations.length) {
result.add(orderedPartitionLocations[index]);
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I am not wrong, the idea is to minimize per row size - and so why column 0 goes "down" the array index, while column 1 goes "up" - and keeps alternating - so that as the size keeps increasing, it is more reasonably distributed for each row (essentially a way to approximate multi-way partition problem).
The result would be different for the formulation above @waitinfuture.
For example:
partition sizes: {1000, 1100, 1300, 1400, 2000, 2500, 3000, 10000, 20000, 25000, 28000, 30000}
subPartitionSize
== 3
subPartitionIndex
== 1
In formulation from PR we have:
task 0: 1000 , 2500 , 3000 , 30000
task 1: 1100 , 2000 , 10000 , 28000
task 2: 1300 , 1400 , 20000 , 25000
So the sizes will be:
task 0: 36500
task 1: 41100
task 2: 47700
As formulated above, we will end up with:
task 0: 1000 , 1400 , 3000 , 25000
task 1: 1100 , 2000 , 10000 , 28000
task 2: 1300 , 2500 , 20000 , 30000
In this case, the sizes will be:
task 0: 30400
task 1: 41100
task 2: 53800
Personally, I would have looked into either largest remainder or knapsack heuristic (given we are sorting anyway).
(Do let me know if I am missing something here @wangshengjie123)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mridulm Sorry for late reply, your understanding is correct, and i should optimize the logic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mridulm for the explanation, I actually didn't get the idea and was thinking the naive way :)
client/src/main/java/org/apache/celeborn/client/ShuffleClientImpl.java
Outdated
Show resolved
Hide resolved
client/src/main/java/org/apache/celeborn/client/read/CelebornInputStream.java
Outdated
Show resolved
Hide resolved
client/src/main/java/org/apache/celeborn/client/read/CelebornInputStream.java
Outdated
Show resolved
Hide resolved
client/src/main/scala/org/apache/celeborn/client/LifecycleManager.scala
Outdated
Show resolved
Hide resolved
common/src/main/java/org/apache/celeborn/common/write/PushState.java
Outdated
Show resolved
Hide resolved
common/src/main/scala/org/apache/celeborn/common/protocol/message/ControlMessages.scala
Outdated
Show resolved
Hide resolved
common/src/main/scala/org/apache/celeborn/common/protocol/message/ControlMessages.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting work @wangshengjie123 !
client/src/main/java/org/apache/celeborn/client/read/CelebornInputStream.java
Outdated
Show resolved
Hide resolved
client/src/main/java/org/apache/celeborn/client/read/CelebornInputStream.java
Outdated
Show resolved
Hide resolved
|
||
int step = locations.length / subPartitionSize; | ||
|
||
// if partition location is [1,2,3,4,5,6,7,8,9,10], and skew partition split to 3 task: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I am not wrong, the idea is to minimize per row size - and so why column 0 goes "down" the array index, while column 1 goes "up" - and keeps alternating - so that as the size keeps increasing, it is more reasonably distributed for each row (essentially a way to approximate multi-way partition problem).
The result would be different for the formulation above @waitinfuture.
For example:
partition sizes: {1000, 1100, 1300, 1400, 2000, 2500, 3000, 10000, 20000, 25000, 28000, 30000}
subPartitionSize
== 3
subPartitionIndex
== 1
In formulation from PR we have:
task 0: 1000 , 2500 , 3000 , 30000
task 1: 1100 , 2000 , 10000 , 28000
task 2: 1300 , 1400 , 20000 , 25000
So the sizes will be:
task 0: 36500
task 1: 41100
task 2: 47700
As formulated above, we will end up with:
task 0: 1000 , 1400 , 3000 , 25000
task 1: 1100 , 2000 , 10000 , 28000
task 2: 1300 , 2500 , 20000 , 30000
In this case, the sizes will be:
task 0: 30400
task 1: 41100
task 2: 53800
Personally, I would have looked into either largest remainder or knapsack heuristic (given we are sorting anyway).
(Do let me know if I am missing something here @wangshengjie123)
client/src/main/java/org/apache/celeborn/client/read/CelebornInputStream.java
Outdated
Show resolved
Hide resolved
client/src/main/scala/org/apache/celeborn/client/CommitManager.scala
Outdated
Show resolved
Hide resolved
client/src/main/scala/org/apache/celeborn/client/commit/ReducePartitionCommitHandler.scala
Outdated
Show resolved
Hide resolved
common/src/main/java/org/apache/celeborn/common/write/PushFailedBatch.java
Outdated
Show resolved
Hide resolved
common/src/main/java/org/apache/celeborn/common/write/PushFailedBatch.java
Outdated
Show resolved
Hide resolved
worker/src/main/scala/org/apache/celeborn/service/deploy/worker/FetchHandler.scala
Outdated
Show resolved
Hide resolved
HI, @wangshengjie123 |
Sorry for late reply, the pr will be updated today or tomorrow |
b3af836
to
599be24
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2373 +/- ##
==========================================
- Coverage 48.77% 48.51% -0.26%
==========================================
Files 209 210 +1
Lines 13109 13186 +77
Branches 1134 1139 +5
==========================================
+ Hits 6393 6396 +3
- Misses 6294 6368 +74
Partials 422 422 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @wangshengjie123 nice pr! Another suggestion is better to add UT for this feature.
@@ -1393,7 +1414,13 @@ public void onSuccess(ByteBuffer response) { | |||
Arrays.toString(partitionIds), | |||
groupedBatchId, | |||
Arrays.toString(batchIds)); | |||
|
|||
if (dataPushFailureTrackingEnabled) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no need for HARD_SPLIT to do this. as worker never write the batch when HARD_SPLIT. cc @waitinfuture
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if it's possible that the master copy succeeds but the copy fails due to HARD_SPLIT. I will check it again
@@ -615,6 +663,17 @@ private boolean fillBuffer() throws IOException { | |||
|
|||
// de-duplicate | |||
if (attemptId == attempts[mapId]) { | |||
if (splitSkewPartitionWithoutMapRange) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- We can reuse one PushFailedBatch object and update inner fields to improve memory-efficient.
- Better to check failedBatches is empty or not first. May be we never need to check failed batches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- get this
- fixed to avid NPE
client/src/main/scala/org/apache/celeborn/client/commit/ReducePartitionCommitHandler.scala
Outdated
Show resolved
Hide resolved
@@ -4671,4 +4671,13 @@ object CelebornConf extends Logging { | |||
.version("0.5.0") | |||
.intConf | |||
.createWithDefault(10000) | |||
|
|||
val CLIENT_DATA_PUSH_FAILURE_TRACKING_ENABLED: ConfigEntry[Boolean] = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be we can use another configuration name for enable optimize skew join. The CLIENT_DATA_PUSH_FAILURE_TRACKING_ENABLED
doesn't feel so straightforward.
UTs is doing, test in cluster this week, uts will be submit later |
8fe3a13
to
12eca26
Compare
@wangshengjie123 Is there any doc or ticket explaining this approach? Also for the sort based approach that you mentioned. |
From my understanding, in this PR we're diverting from vanilla spark approach based on mapIndex and just dividing the full partition into multiple sub-partition based on some heuristics. I'm new to Celeborn code, so might be missing something basic but in this PR we're not addressing below issue. If we consider a basic scenario where a partial partition read is happening and we see a FetchFailure.
The data generated on location This can cause both Data loss and Data duplication, this might be getting addressed in some other place in the codebase that i'm not aware of but i wanted point this problem out. |
@s0nskar Good point, this should be an issue for ResultStage, even though the ShuffleMapStage's output is deterministic. IIRC, vanilla Spark also has some limitations on stage retry cases for ResultStage when ShuffleMapStage's output is indeterministic, for such cases, we need to fail the job, right? |
@pan3793 This does not become problem if we are maintaining the concept of mapIndex ranges as spark will always read deterministic output for each sub-partition. As vanilla spark always read deterministic output because of mapIndex range filter, it will not face this issue. In this approach sub-partitions data will be indeterministic across stage attempts. Failing would be only option for such cases until spark start supporting ResultStage rollback. |
Also, I think this issue would not be only limited to ResultStage, this can happen with ShuffleMapStage as well in some complex cases. Consider another scenario –
This is case though, we can rollback the whole lineage till this point instead of failing this job. Similar to what vanilla spark does, what this will be very expensive. |
@s0nskar I see your point. When consuming skew partitions, we should always treat the previous |
Hi @s0nskar , thanks for your point, I think you are correct. Seems this PR conflicts with stage rerun.
@pan3793 Is it possible to force make it as Also, I think Spark doesn't correctly set stage's determinism for some cases, for example a row_number window operator followed by aggregation keyed by the row_number. |
The sort based approach is roughly like this:
|
Thanks a lot @waitinfuture for the sort based approach description.
IMO this would be very difficult to do it from Celeborn itself but it can be done by putting a patch in the Spark code. ShuffledRowRDD can set Determinacy Level to INDETEMINATE if partial partition reads are happening and Celeborn is getting is used. cc: @mridulm for viz |
@waitinfuture It seems this PR is getting attention, some discussions happened offline, we'd better update the PR description(or Google Docs) to summarize the whole design and known issues so far |
} | ||
- PartialReducerPartitionSpec(reducerId, startMapIndex, endMapIndex, dataSize) | ||
+ if (splitSkewPartitionWithCeleborn) { | ||
+ PartialReducerPartitionSpec(reducerId, mapStartIndices.length, i, dataSize) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can maybe add a note here that these dataSize will not be accurate. Even though in the current downstream code, we're only getting the sum of dataSize which should be equal but someone might be using these differently.
It has been a while since I looked at this PR - but as formulated, the split into subranges is deterministic (if it is not, it should be made so). |
The way Celeborn splits partition is not deterministic with stage rerun, for example any push failure will cause split, so I'm afraid this statement does not hold... |
Ah, I see what you mean ... I will need to relook at the PR, and how it interact with Celeborn - but if scenarios directly described in SPARK-23207 (or variants of it) are applicable (and we cant mitigate it), we should not proceed down this path given the correctness implications unfortunately. |
+CC @otterc as well. |
Maybe we can remain both this optimization and stage rerun, but only allows one to take effect by checking configs for now. The performance issue this PR solves does happen in production. |
e96516e
to
b3a30be
Compare
This PR is stale because it has been open 20 days with no activity. Remove stale label or comment or this will be closed in 10 days. |
This PR is stale because it has been open 20 days with no activity. Remove stale label or comment or this will be closed in 10 days. |
This PR is stale because it has been open 20 days with no activity. Remove stale label or comment or this will be closed in 10 days. |
This PR is stale because it has been open 20 days with no activity. Remove stale label or comment or this will be closed in 10 days. |
This PR is stale because it has been open 20 days with no activity. Remove stale label or comment or this will be closed in 10 days. |
This PR is stale because it has been open 20 days with no activity. Remove stale label or comment or this will be closed in 10 days. |
This PR is stale because it has been open 20 days with no activity. Remove stale label or comment or this will be closed in 10 days. |
This issue was closed because it has been staled for 10 days with no activity. |
Is this optimization used in your production environment? |
We are already using this optimization in certain specific skew scenarios. However, due to conflicts with stage rerun, we are not applying it to all jobs. btw Community also plans to address these issues in the coming weeks, you can keep an eye on it. |
What changes were proposed in this pull request?
Add logic to support avoid sorting shuffle files for Reduce mode when optimize skew partitions
Why are the changes needed?
Current logic need sorting shuffle files when read Reduce mode skew partition shuffle files, we found some shuffle sorting timeout and performance issue
Does this PR introduce any user-facing change?
No
How was this patch tested?
Cluster test and uts