Make state spillable in partitioned writer [databricks] #8667

abellina · 2023-07-06T18:58:23Z

This change ensures that the partitioned writer queued batches are made spillable prior to releasing the semaphore.

I did change the ParquetWriterSuite such that exceptions being checked during writer failures look at the cause and find a SparkUpgradeException, instead of a SparkException to reflect the current state of some checks done in parquet specifically for date/time rebasing. I didn't include fixes for this, just made the tests test for the exception actually thrown.

I verified that this change allows the store_sales nds_transcode at 50GB to work with only 3GB of GPU memory, whereas I needed 12GB of memory before the change (failing with OOM at various places with less than 12GB due to all of the state kept around without the semaphore held). I used 16 shuffle partitions to make the write tasks process around 500MiB each. I used the following command and template to test, changing the JAR location (and the allocSize to test when the baseline would stop OOMing):

./spark-submit-template test.template nds_transcode.py file:///data/nds_50_csv file:///data/nds_50_parquet report.out --output_mode overwrite --tables store_sales

export JARS=rapids-4-spark_2.12-23.08.0-SNAPSHOT-cuda11.jar
export SPARK_CONF=("--master" "spark://127.0.0.1:7077"
                   "--conf" "spark.plugins=com.nvidia.spark.SQLPlugin"
                   "--conf" "spark.executor.extraClassPath=$JARS"
                   "--conf" "spark.driver.extraClassPath=$JARS"
                   "--conf" "spark.eventLog.enabled=true"
                   "--conf" "spark.sql.shuffle.partitions=16"
                   "--conf" "spark.rapids.memory.gpu.allocSize=3GB"
                   "--conf" "spark.rapids.sql.csv.read.decimal.enabled=true")

Signed-off-by: Alessandro Bellina <[email protected]>

abellina · 2023-07-06T19:11:39Z

build

...lake/common/src/main/scala/com/nvidia/spark/rapids/delta/GpuDeltaTaskStatisticsTracker.scala

sql-plugin/src/main/scala/com/nvidia/spark/rapids/ColumnarOutputWriter.scala

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuFileFormatDataWriter.scala

…path

abellina · 2023-07-06T22:15:42Z

build

abellina · 2023-07-06T22:45:02Z

build

abellina · 2023-07-07T03:56:37Z

build

abellina · 2023-07-07T04:17:58Z

build

abellina · 2023-07-17T20:43:34Z

Another follow on issue: #8736

abellina · 2023-07-17T22:16:08Z

@jlowe @revans2 this should be ready for another look. I added unit tests that cover the interaction with batches to write, spillables, and splits, but that do not validate the outputs. I wanted to get this reviewed and follow up with more testing either via integration tests or unit tests as follow on work. It would be nice to get runtime for this patch in general.

#8738 for follow on test work.

abellina · 2023-07-17T22:16:12Z

build

abellina · 2023-07-21T18:14:10Z

I am working on the failure. It looks like a failure to shutdown RMM causing failures in other tests. It seems related to mocking exceptions in the tests, so that looks to be a real issue. Will post once I have it

abellina · 2023-07-24T15:45:55Z

build

abellina · 2023-07-24T15:47:27Z

build

…into spillable_splits_part_writer

…itialized

abellina · 2023-07-25T19:23:49Z

build

abellina · 2023-07-25T20:03:05Z

I am having lots of issues with DeviceMemoryEventHandlerSuite and this build will fail. I cannot explain the issue yet, as the suite initializes all mocks and uses them. The test passes by itself, but not in combination of others, which is taking a very long time to go through. I'll push again.

abellina · 2023-07-25T20:23:54Z

Ok the reason why RmmSparkRetrySuiteBase was flaky was because it was failing if we had leftover RMM callbacks configured (singletons). It was also a bad test because it was succeeding for the wrong reasons after changes happened to the impl.

abellina · 2023-07-25T20:24:07Z

build

abellina · 2023-07-26T14:59:23Z

@revans2 this is ready for another look

revans2

Just a few nits

revans2 · 2023-07-26T15:08:21Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/ColumnarOutputWriter.scala

+  protected def getOutputStream: FSDataOutputStream = {
    val hadoopPath = new Path(path)
    val fs = hadoopPath.getFileSystem(conf)
    fs.create(hadoopPath, false)
  }
+
+  protected val outputStream: FSDataOutputStream = getOutputStream


nit: Could you add that as a comment to getOutputStream so we know why it is there?

revans2 · 2023-07-26T15:18:01Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/ColumnarOutputWriter.scala

+      spillableBatch: SpillableColumnarBatch,
+      statsTrackers: Seq[ColumnarWriteTaskStatsTracker]): Long = {
+    val writeStartTime = System.nanoTime
+    val cb = closeOnExcept(spillableBatch) { _ =>


nit: I think it would be a little cleaner to move cb to be inside of the closeOnExcept and not recreate it.

closeOnExcept(spillableBatch) { _ => val cb = withRetryNoSplit[ColumnarBatch] { spillableBatch.getColumnarBatch() } withResource(cb) { _ => throwIfRebase... } }

abellina · 2023-07-26T15:48:37Z

@revans2 thanks for the review, pushed a08e5a5 to handle both issues.

abellina · 2023-07-26T15:48:42Z

build

abellina added 2 commits July 6, 2023 13:55

Make state spillable in partitioned writer

13e3ab2

Signed-off-by: Alessandro Bellina <[email protected]>

Fix import order

e349864

revans2 reviewed Jul 6, 2023

View reviewed changes

abellina added 5 commits July 6, 2023 15:16

Move withRestoreOnRetry outside of bufferBatchAndClose only on retry …

dce0f19

…path

Remove extraneous withRetryNoSplit

6eeb1a0

Address review comments in GpuFileFormatDataWriter

7fd33d6

Add comment around the anythingWritten flag in ColumnarOutputWriter

e6a739b

Fix leak in my revisions

f73790e

abellina mentioned this pull request Jul 6, 2023

[FEA] Ensure no side effects in GpuDeltaTaskStatisticsTracker so we can retry stats calculation #8670

Closed

Add withRetryNoSplit in GpuDeltaTaskStatisticsTracker.newBatch

c056912

Fix import order

9825536

Do not close spillable in GpuDeltaTaskStatisticsTracker

a921bfd

Pass template argument to withRetryNoSplit

fbe0fb4

revans2 previously approved these changes Jul 7, 2023

View reviewed changes

abellina added 3 commits July 12, 2023 16:27

Upmerge

e5fe2e6

Fix upmerge issues

4922214

Add unit tests

ab4633f

abellina dismissed revans2’s stale review via ab4633f July 17, 2023 20:39

Import order

1420101

abellina added 3 commits July 17, 2023 15:49

Make sure to pass TaskContext

c989223

Remove debug statements

57bd89d

Remove more debug logic

963a99a

abellina added 3 commits July 21, 2023 16:09

Close spillable if we cant materialize the whole batch

1ade1f6

Close batches here for now in the test

3c903d3

We need to close batches in our mock to maintain expectations

8c57ff2

Batches are now closed correctly from mock

22bafa4

abellina added 7 commits July 24, 2023 12:26

Make sure that we close existing sessions in SparkQueryCompreTestSuite

8fe1d18

Handling of session cleanup is happening at superclass

f556fbc

Merge branch 'branch-23.08' of https://github.com/NVIDIA/spark-rapids …

fd96e54

…into spillable_splits_part_writer

Fixes unit test issues where the catalog/semaphore were being left in…

df425ce

…itialized

Remove extra line

f746816

Unused imports

3e0445b

Fix issues with DeviceMemoryEventHandlerSuite

87f22cc

Use RmmSparkRetrySuiteBase to reset rmm event handlers

54c83b3

revans2 previously approved these changes Jul 26, 2023

View reviewed changes

Apply code review comments

a08e5a5

abellina dismissed revans2’s stale review via a08e5a5 July 26, 2023 15:48

revans2 approved these changes Jul 26, 2023

View reviewed changes

abellina merged commit 5421a85 into NVIDIA:branch-23.08 Jul 26, 2023
27 checks passed

abellina deleted the spillable_splits_part_writer branch July 26, 2023 23:23

sameerz added the reliability Features to improve reliability or bugs that severly impact the reliability of the plugin label Jul 31, 2023

jlowe mentioned this pull request Sep 27, 2023

Fix RMM crash in FileCacheIntegrationSuite with ARENA memory allocator #9314

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make state spillable in partitioned writer [databricks] #8667

Make state spillable in partitioned writer [databricks] #8667

abellina commented Jul 6, 2023 •

edited

Loading

abellina commented Jul 6, 2023

abellina commented Jul 6, 2023

abellina commented Jul 6, 2023

abellina commented Jul 7, 2023

abellina commented Jul 7, 2023

abellina commented Jul 17, 2023

abellina commented Jul 17, 2023 •

edited

Loading

abellina commented Jul 17, 2023

abellina commented Jul 21, 2023

abellina commented Jul 24, 2023

abellina commented Jul 24, 2023

abellina commented Jul 25, 2023

abellina commented Jul 25, 2023 •

edited

Loading

abellina commented Jul 25, 2023

abellina commented Jul 25, 2023

abellina commented Jul 26, 2023

revans2 left a comment

revans2 Jul 26, 2023

revans2 Jul 26, 2023

abellina commented Jul 26, 2023

abellina commented Jul 26, 2023

Make state spillable in partitioned writer [databricks] #8667

Make state spillable in partitioned writer [databricks] #8667

Conversation

abellina commented Jul 6, 2023 • edited Loading

abellina commented Jul 6, 2023

abellina commented Jul 6, 2023

abellina commented Jul 6, 2023

abellina commented Jul 7, 2023

abellina commented Jul 7, 2023

abellina commented Jul 17, 2023

abellina commented Jul 17, 2023 • edited Loading

abellina commented Jul 17, 2023

abellina commented Jul 21, 2023

abellina commented Jul 24, 2023

abellina commented Jul 24, 2023

abellina commented Jul 25, 2023

abellina commented Jul 25, 2023 • edited Loading

abellina commented Jul 25, 2023

abellina commented Jul 25, 2023

abellina commented Jul 26, 2023

revans2 left a comment

Choose a reason for hiding this comment

revans2 Jul 26, 2023

Choose a reason for hiding this comment

revans2 Jul 26, 2023

Choose a reason for hiding this comment

abellina commented Jul 26, 2023

abellina commented Jul 26, 2023

abellina commented Jul 6, 2023 •

edited

Loading

abellina commented Jul 17, 2023 •

edited

Loading

abellina commented Jul 25, 2023 •

edited

Loading