repartition-based fallback for hash aggregate v3 #11712

binmahone · 2024-11-08T07:20:58Z

This PR replaces #11116, since there has been too many differences with #11116.

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

…ition_agg

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

…tor leak Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

…ition_agg_v3

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

Signed-off-by: Firestarman <[email protected]>

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

…ition_agg_v3

…gether small buckets Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

…ition_agg_v3

revans2

I have not finished yet. Could you post an explanation of the changes? I see places in the code that appear to have duplicate functionality. Not to mention we have the old sort based agg code completely duplicating a lot of the newer hash re-partition based code.

I really just want to understand what the work flow is supposed to be?

revans2 · 2024-11-08T14:59:11Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuColumnarBatchSerializer.scala


  override def serializeStream(out: OutputStream): SerializationStream = new SerializationStream {
    private[this] val dOut: DataOutputStream =
      new DataOutputStream(new BufferedOutputStream(out))

    override def writeValue[T: ClassTag](value: T): SerializationStream = {
+      val start = System.nanoTime()


This is going to include the I/O time for writing to out, which can include compression/encryption in addition to I/O depending on the shuffle manager used. Some that do the processing in a background thread will not show this, but the default implementations will show it.

revans2 · 2024-11-08T15:01:18Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuSemaphore.scala

+    }
+  }
+
+  def voluntaryRelease(context: TaskContext): Unit = {


Can we please have some docs here to explain what is going on? This feels like a totally different feature from hash re-partitioning, and I would like to measure performance changes for it separate from anything with hash aggregation changes.

revans2 · 2024-11-08T15:02:05Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala

@@ -551,6 +551,14 @@ val GPU_COREDUMP_PIPE_PATTERN = conf("spark.rapids.gpu.coreDump.pipePattern")
      .integerConf
      .createWithDefault(2)

+  val ENABLE_VOLUNTARY_GPU_RELEASE_CHECK = conf("spark.rapids.gpu.voluntaryReleaseCheck")
+    .doc("If true, the plugin will check if voluntary release of GPU is forbidden, " +
+      "e.g. when GpuAggregateExec still have more output batches to offer." +


missing a space at the end

revans2 · 2024-11-08T15:05:43Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala

+    .checkValues(Set("sort", "repartition"))
+    .createWithDefault("repartition")
+
+  // todo: remove this


Do we have an follow on issue for that.

revans2 · 2024-11-08T15:07:42Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala

@@ -1558,6 +1566,25 @@ val GPU_COREDUMP_PIPE_PATTERN = conf("spark.rapids.gpu.coreDump.pipePattern")
    .checkValue(v => v >= 0 && v <= 1, "The ratio value must be in [0, 1].")
    .createWithDefault(1.0)

+  val AGG_OUTPUT_SIZE_RATIO = conf("spark.rapids.sql.agg.outputSizeRatioToBatchSize")
+    .doc("The ratio of the output size of an aggregation to the batch size. ")


How is this used and how is it different from spark.rapids.sql.agg.skipAggPassReductionRatio? It is not explained well at all here.

revans2 · 2024-11-08T15:10:16Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/SpillableColumnarBatch.scala

@@ -117,8 +117,8 @@ class SpillableColumnarBatchImpl (
  }

  override def getColumnarBatch(): ColumnarBatch = {
+    GpuSemaphore.acquireIfNecessary(TaskContext.get())


Why? Is this because we can have locked the rapids buffer while blocked waiting to get on the semaphore? Did you see this show up in practice?

This feels like a bug fix and not a part of reparation-based fallback. I would much rather see this as a separate issue/PR.

revans2 · 2024-11-08T15:12:51Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/SortBasedGpuAggregateExec.scala

+import org.apache.spark.sql.types._
+import org.apache.spark.sql.vectorized.ColumnarBatch
+
+object SBAggregateUtils {


nit: can we not use abbreviations? If this is Sort Based? can we just call it SortFallbackAggregateUtils or something similar?

revans2 · 2024-11-08T15:37:09Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuAggregateExec.scala

@@ -335,7 +513,10 @@ class AggHelper(
    // We need to merge the aggregated batches into 1 before calling post process,
    // if the aggregate code had to split on a retry
    if (aggregatedSeq.size > 1) {
-      val concatted = concatenateBatches(metrics, aggregatedSeq)
+      val concatted =
+        withResource(aggregatedSeq) { _ =>


I am confused by this? Was this a bug? This change feels wrong to me.

concatenateBatches has the contract that it will either close everything in toConcat, or if there is a single item in the sequence it will just return it without closing anything. By putting it within a withResource it looks like we are going to double close the data in aggregatedSeq.

SpillableColumnarBatch has the nasty habit (?) of hiding double closes from us (https://github.com/NVIDIA/spark-rapids/blob/branch-24.12/sql-plugin/src/main/scala/com/nvidia/spark/rapids/SpillableColumnarBatch.scala#L137). I'd like to remove this behavior with my spillable changes.

abellina · 2024-11-08T15:52:29Z

I will review this today

revans2 · 2024-11-08T20:43:44Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuAggregateExec.scala

+          realIter = Some(ConcatIterator.apply(firstPassIter,
+            (aggOutputSizeRatio * configuredTargetBatchSize).toLong
+          ))
+          firstPassAggToggle.set(false)


I don't think that this does what you think that it does. The line that reads this is used to create an iterator. It is not within an iterator which decides if we should or should not do the agg. I added in some print statements and I have verified that it does indeed agg for every batch, even if the first batch set this to false. Which is a good thing because if you disabled the initial aggregation on something where the output types do not match the input types you would get a crash or data corruption.

revans2 · 2024-11-08T21:32:59Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuAggregateExec.scala

+
+      // Handle the case of skipping second and third pass of aggregation
+      // This only work when spark.rapids.sql.agg.skipAggPassReductionRatio < 1
+      if (!firstBatchChecked && firstPassIter.hasNext


If we are doing an aggregate every time, would it be better to check each batch and skip repartitioning if the batch stayed large?

binmahone added 30 commits July 1, 2024 17:14

workable version without tests

f5d21a9

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

doc

10b7d20

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

fix scala 2.13

4451c54

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

fix compile

4da5797

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

fix it

e803c36

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

enable it

0b50434

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

metric name

a000c9b

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

minor

82cacbf

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

change seed

4cf4a45

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

fix comments

367a273

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

minor

74424b8

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

Merge remote-tracking branch 'origin/branch-24.08' into 240701_repart…

a50261d

…ition_agg

temp

634b01a

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

add log and reverse batch

f1c1c0b

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

runable for optmization mode

603299a

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

clean repartition agg, remove sort agg

e96b2b5

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

stream agg

54cc016

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

suppport skip agg and repartition

a159249

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

clean code

ca0620c

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

fix

4ddbc48

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

remove all logs

97397a4

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

address comments, and remove the concept of fallback

62ff869

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

in mid of shortcircuit for small batches

9e47dfa

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

forbid volunteer release GPU if aggregate exec has more batches to offer

d0e74f3

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

Add companion metrics for all nsTiming metrics

7a69fa6

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

fix conf bug for check volunary release

b4f8272

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

concat output batches, and use CloseableBufferedIterator to fix itera…

053a019

…tor leak Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

Merge remote-tracking branch 'origin/branch-24.10' into agg_v2_patch

4ad5467

Merge branch 'agg_v2_patch' into 240801_repartition_agg_v2

b0cae01

temp, ut can pass

134b094

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

binmahone and others added 20 commits August 20, 2024 16:35

fix isorted, seems ok now

126fe71

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

simply code 1

0653923

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

clean code, remove AggregatePartition

ac88577

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

remove tryMergeAggregatedBatches

c6900b6

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

clean code, no repartition case

ec95179

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

fix seed 1724149115

85bafaa

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

remove -s

8f0a796

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

fix single batch optimization bug

dc9193f

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

remove concat and merge for inputs

fa6acf4

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

Merge remote-tracking branch 'origin/branch-24.10' into 240821_repart…

0ab65c7

…ition_agg_v3

add toggle spark.rapids.sql.agg.outputSizeRatioToBatchSize

1edb32d

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

bring sort based back

557766e

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

fix skip agg bug

eeb5832

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

add metrics for skip agg, and skip sebsequent batches agg

b336131

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

Add more metrics for shuffle

eee9362

Signed-off-by: Firestarman <[email protected]>

refine companion metrics

b9bc058

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

Merge remote-tracking branch 'origin/branch-24.10' into 240821_repart…

3f26f4a

…ition_agg_v3

agg v3.3: optimize for merge-able neighbour input batches, and put to…

fbc07bd

…gether small buckets Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

fixing 137 OOM

9904d65

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

Merge remote-tracking branch 'origin/branch-24.12' into 240821_repart…

df4dcae

…ition_agg_v3

binmahone marked this pull request as draft November 8, 2024 07:21

binmahone changed the title ~~240821 repartition agg v3~~ repartition-based fallback for hash aggregate v3 Nov 8, 2024

abellina requested review from abellina and revans2 and removed request for revans2 November 8, 2024 14:33

revans2 reviewed Nov 8, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

repartition-based fallback for hash aggregate v3 #11712

repartition-based fallback for hash aggregate v3 #11712

binmahone commented Nov 8, 2024 •

edited

Loading

revans2 left a comment

revans2 Nov 8, 2024

revans2 Nov 8, 2024

revans2 Nov 8, 2024

revans2 Nov 8, 2024

revans2 Nov 8, 2024

revans2 Nov 8, 2024

revans2 Nov 8, 2024

revans2 Nov 8, 2024

abellina Nov 8, 2024

abellina commented Nov 8, 2024

revans2 Nov 8, 2024

revans2 Nov 8, 2024

repartition-based fallback for hash aggregate v3 #11712

Are you sure you want to change the base?

repartition-based fallback for hash aggregate v3 #11712

Conversation

binmahone commented Nov 8, 2024 • edited Loading

revans2 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abellina commented Nov 8, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

binmahone commented Nov 8, 2024 •

edited

Loading