[BUG] `cudaErrorInvalidDevice` in ParquetChunkedReader with a small number of executor cores #11565

jihoonson · 2024-10-08T00:21:36Z

Describe the bug

A simple join query fails with the error below.

24/10/07 17:06:59 WARN TaskSetManager: Lost task 0.0 in stage 31.0 (TID 12031) (10.110.47.50 executor 4): java.io.IOException: Error when processing path: file:///home/jihoons/data/tpcds/sf=10/parquet/store/part-00000-5b8288c9-c0f1-494a-9871-9d7fa7bd822e-c000.snappy.parquet, range: 0-17505, partition values: [empty row]
	at com.nvidia.spark.rapids.ParquetTableReader.$anonfun$next$1(GpuParquetScan.scala:2701)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
	at com.nvidia.spark.rapids.ParquetTableReader.next(GpuParquetScan.scala:2692)
	at com.nvidia.spark.rapids.ParquetTableReader.next(GpuParquetScan.scala:2664)
	at com.nvidia.spark.rapids.CachedGpuBatchIterator$.$anonfun$apply$1(GpuDataProducer.scala:159)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
	at com.nvidia.spark.rapids.CachedGpuBatchIterator$.apply(GpuDataProducer.scala:156)
	at com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.$anonfun$readBatch$4(GpuMultiFileReader.scala:1065)
	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$AutoCloseableAttemptSpliterator.next(RmmRapidsRetryIterator.scala:477)
	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryIterator.next(RmmRapidsRetryIterator.scala:613)
	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryAutoCloseableIterator.next(RmmRapidsRetryIterator.scala:517)
	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$.drainSingleWithVerification(RmmRapidsRetryIterator.scala:291)
	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$.withRetryNoSplit(RmmRapidsRetryIterator.scala:132)
	at com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.$anonfun$readBatch$1(GpuMultiFileReader.scala:1058)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
	at com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.readBatch(GpuMultiFileReader.scala:1031)
	at com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.next(GpuMultiFileReader.scala:1011)
	at com.nvidia.spark.rapids.PartitionIterator.hasNext(dataSourceUtil.scala:29)
	at com.nvidia.spark.rapids.MetricsBatchIterator.hasNext(dataSourceUtil.scala:46)
	at com.nvidia.spark.rapids.shims.GpuDataSourceRDD$$anon$1.$anonfun$hasNext$1(GpuDataSourceRDD.scala:74)
	at com.nvidia.spark.rapids.shims.GpuDataSourceRDD$$anon$1.$anonfun$hasNext$1$adapted(GpuDataSourceRDD.scala:74)
	at scala.Option.exists(Option.scala:376)
	at com.nvidia.spark.rapids.shims.GpuDataSourceRDD$$anon$1.hasNext(GpuDataSourceRDD.scala:74)
	at com.nvidia.spark.rapids.shims.GpuDataSourceRDD$$anon$1.advanceToNextIter(GpuDataSourceRDD.scala:98)
	at com.nvidia.spark.rapids.shims.GpuDataSourceRDD$$anon$1.hasNext(GpuDataSourceRDD.scala:74)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at org.apache.spark.sql.rapids.GpuFileSourceScanExec$$anon$1.hasNext(GpuFileSourceScanExec.scala:477)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
	at com.nvidia.spark.rapids.CollectTimeIterator.$anonfun$hasNext$1(GpuExec.scala:254)
	at com.nvidia.spark.rapids.CollectTimeIterator.$anonfun$hasNext$1$adapted(GpuExec.scala:253)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
	at com.nvidia.spark.rapids.CollectTimeIterator.hasNext(GpuExec.scala:253)
	at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.getHasOnDeck(GpuCoalesceBatches.scala:313)
	at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.hasNext(GpuCoalesceBatches.scala:330)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
	at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
	at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
	at scala.collection.TraversableOnce.to(TraversableOnce.scala:366)
	at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364)
	at scala.collection.AbstractIterator.to(Iterator.scala:1431)
	at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:358)
	at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:358)
	at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1431)
	at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:345)
	at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:339)
	at scala.collection.AbstractIterator.toArray(Iterator.scala:1431)
	at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1049)
	at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2438)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
	at org.apache.spark.scheduler.Task.run(Task.scala:141)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)
Caused by: ai.rapids.cudf.CudfException: copy_if failed on 1st step: cudaErrorInvalidDevice: invalid device ordinal
	at ai.rapids.cudf.ParquetChunkedReader.readChunk(Native Method)
	at ai.rapids.cudf.ParquetChunkedReader.readChunk(ParquetChunkedReader.java:170)
	at com.nvidia.spark.rapids.ParquetTableReader.$anonfun$next$1(GpuParquetScan.scala:2694)
	... 63 more

Steps/Code to reproduce bug

Here is the data and query I'm using.

Data: TPC-DS at sf=10
Query:

select s.* 
from store s, store_sales ss 
where s.s_store_sk <= ss.ss_store_sk and s.s_store_sk + 21 > ss.ss_store_sk;

Per my observation so far, this query fails when spark.executor.cores is set to less than 10. It works fine otherwise. When the executor core count was set to 10, there were 6 executors created since my machine has 64 cpu cores. When the executor core count was 9, there were 7 executors created.

Expected behavior
Ideally, the query should run successfully even with a small number of executor cores. Or, if this is some edge case that the plugin cannot execute the query, then the query should fail with a more user-friendly error message.

Environment details (please complete the following information)

Environment location: single node standalone cluster. The node has 64 CPU cores.
Spark configuration settings related to the issue

SPARK_WORKER_OPTS="-Dspark.worker.resource.gpu.amount=1 -Dspark.worker.resource.gpu.discoveryScript=/path/to/getGpusResources.sh"

export SPARK_CONF=("--master" "spark:/my-master:7077"
                   "--conf" "spark.driver.maxResultSize=2GB"
                   "--conf" "spark.driver.memory=50G"
                   "--conf" "spark.executor.cores=9" <== setting this to 10 or a greater can avoid the error
                   "--conf" "spark.executor.instances=1" <== this did not take effect
                   "--conf" "spark.executor.memory=16G"
                   "--conf" "spark.sql.adaptive.enabled=true"
                   "--conf" "spark.plugins=com.nvidia.spark.SQLPlugin"
                   "--conf" "spark.rapids.memory.host.spillStorageSize=1MB"
                   "--conf" "spark.rapids.memory.pinnedPool.size=8g"
                   "--conf" "spark.rapids.sql.concurrentGpuTasks=4"
                   "--conf" "spark.rapids.sql.explain=all"
                   "--conf" "spark.eventLog.enabled=true"
                   "--conf" "spark.eventLog.dir=/tmp/spark-events"
                   "--conf" "spark.driver.extraClassPath=$SPARK_RAPIDS_PLUGIN_JAR"
                   "--conf" "spark.executor.extraClassPath=$SPARK_RAPIDS_PLUGIN_JAR"
                   "--conf" "spark.executor.extraJavaOptions=-Dai.rapids.cudf.nvtx.enabled=true"
                   "--conf" "spark.rapids.sql.batchSizeBytes=256MB"
                   "--conf" "spark.files.maxPartitionBytes=268435456"
                   "--conf" "spark.rapids.memory.gpu.allocSize=3758096384"
                   "--conf" "spark.rapids.sql.metrics.level=DEBUG")

The text was updated successfully, but these errors were encountered:

jihoonson · 2024-10-08T00:22:26Z

I am aware of #11215 and rapidsai/cudf#16186, but not entirely sure if these are the same issue.

mattahrens · 2024-10-08T20:45:46Z

What version of the plugin jar are you running with?

jihoonson · 2024-10-08T21:03:35Z

@mattahrens I've been using a branch in my repo which is based on branch-24.10. I will try with the latest branch and see if the problem still remains.

jihoonson · 2024-10-09T18:43:02Z

Just tested. branch-24.12 has the same issue.

jihoonson · 2024-10-10T00:07:09Z

It turns out that the issue is not about the GPU device setting (thanks to @abellina's help for debugging). It's about the memory, especially the so-called "reserve" memory. Because the GPU memory size is fixed in my setting, the total amount of GPU memory used by all executors grows as more executors are created. The amount assigned to each executor was 3.5G and my machine has 24G of total GPU memory. When the executor core count was set to 9, 7 executors were created (because my machine has 64 cpu cores), which would have left only few memory for GPU kernel execution unless otherwise it has exhausted memory. This likely has caused some kernel failure. So, the failure is legit as we cannot proceed query processing in this case, but the error message should be improved.

revans2 · 2024-10-10T14:13:43Z

@jihoonson Are you running multiple spark processes sharing a single GPU? Was this on purpose?

jihoonson · 2024-10-11T22:32:52Z

@revans2 yes I was, but it was not on purpose. It rather accidentally happened. I wanted to limit the executor process count to 1, but spark.executor.instances=1 did not work as I expected. I did not realize that I had multiple executor processes running until I looked at the logs more closely to debug this invalid device error.

revans2 · 2024-10-14T16:05:02Z

@abellina and @jlowe do we want to update GpuDeviceManager so that spark.rapids.memory.gpu.allocSize will still honor the things like the min/max fraction? I am conflicted here because the allocSize is an internal only config because there were no checks and it was supposed to be for internal testing. But if devs are running into issues with it, should we try and find ways to avoid the foot-gun problem?

abellina · 2024-10-14T16:10:23Z

With reserve, in general, I'd be interested to know if there is a way to make the error more user friendly as that's the part that was really confusing here.

I think allocSize we could change so that it warns loudly (on the driver... never mind it's got to be on the executor), "it looks like you went above reserve!" or something like that?

jlowe · 2024-10-14T16:13:44Z

I'm personally OK if a spark.rapids.memory.gpu.allocSize setting causes a crash if it goes above the max fraction (or otherwise conflicts with other settings). Essentially you'd have to increase max alloc fraction and/or reduce reserve amount configs to allocate beyond normal upper limit even with explicit allocSize setting.

revans2 · 2024-10-14T16:34:28Z

The other problem is that there is a race condition on start up. We check the reserve based on the memory that is currently free on the GPU. We don't check it after the pool has been initialized, so if there are multiple tasks trying to grab memory at the same time we might get into trouble with detecting reserved memory. That said some checks that are racy are better than no checks at all.

Another thing that concerns me is that I have seen the async allocator treat the pool size as a suggestion more than a hard limit. We limit the total number of bytes that can be allocated, but the async pool is not the one that is doing the limiting. If there are a lot of threads/streams then fragmentation between these sub-pools used for each stream can make it grow beyond the limit we set. This might be the cause of a similar failure we saw in a customer's query. It might be that we actually did run out of GPU memory. Not sure though.

jihoonson added ? - Needs Triage Need team to review and classify bug Something isn't working labels Oct 8, 2024

mattahrens assigned jihoonson Oct 8, 2024

mattahrens removed the ? - Needs Triage Need team to review and classify label Oct 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] `cudaErrorInvalidDevice` in ParquetChunkedReader with a small number of executor cores #11565

[BUG] `cudaErrorInvalidDevice` in ParquetChunkedReader with a small number of executor cores #11565

jihoonson commented Oct 8, 2024 •

edited

Loading

jihoonson commented Oct 8, 2024

mattahrens commented Oct 8, 2024

jihoonson commented Oct 8, 2024

jihoonson commented Oct 9, 2024

jihoonson commented Oct 10, 2024

revans2 commented Oct 10, 2024

jihoonson commented Oct 11, 2024 •

edited

Loading

revans2 commented Oct 14, 2024

abellina commented Oct 14, 2024 •

edited

Loading

jlowe commented Oct 14, 2024

revans2 commented Oct 14, 2024

[BUG] cudaErrorInvalidDevice in ParquetChunkedReader with a small number of executor cores #11565

[BUG] cudaErrorInvalidDevice in ParquetChunkedReader with a small number of executor cores #11565

Comments

jihoonson commented Oct 8, 2024 • edited Loading

jihoonson commented Oct 8, 2024

mattahrens commented Oct 8, 2024

jihoonson commented Oct 8, 2024

jihoonson commented Oct 9, 2024

jihoonson commented Oct 10, 2024

revans2 commented Oct 10, 2024

jihoonson commented Oct 11, 2024 • edited Loading

revans2 commented Oct 14, 2024

abellina commented Oct 14, 2024 • edited Loading

jlowe commented Oct 14, 2024

revans2 commented Oct 14, 2024

[BUG] `cudaErrorInvalidDevice` in ParquetChunkedReader with a small number of executor cores #11565

[BUG] `cudaErrorInvalidDevice` in ParquetChunkedReader with a small number of executor cores #11565

jihoonson commented Oct 8, 2024 •

edited

Loading

jihoonson commented Oct 11, 2024 •

edited

Loading

abellina commented Oct 14, 2024 •

edited

Loading