[BUG] Timestamp queries with Iceberg throw ClassCastException #511

engechas · 2024-08-02T06:27:42Z

What is the bug?
When running certain queries that involve timestamp fields against Iceberg tables an exception is thrown during query execution:

24/07/26 20:16:19 ERROR Executor: Exception in task 4.3 in stage 0.0 (TID 17)
--
java.lang.ClassCastException: class org.apache.iceberg.shaded.org.apache.arrow.vector.TimeStampMicroVector cannot be cast to class org.apache.iceberg.shaded.org.apache.arrow.vector.BigIntVector (org.apache.iceberg.shaded.org.apache.arrow.vector.TimeStampMicroVector and org.apache.iceberg.shaded.org.apache.arrow.vector.BigIntVector are in unnamed module of loader 'app')
at org.apache.iceberg.arrow.vectorized.VectorizedArrowReader.allocateVectorBasedOnOriginalType(VectorizedArrowReader.java:273) ~[iceberg-spark-runtime-3.3_2.12-1.2.0-amzn-0.jar:?]
at org.apache.iceberg.arrow.vectorized.VectorizedArrowReader.allocateFieldVector(VectorizedArrowReader.java:218) ~[iceberg-spark-runtime-3.3_2.12-1.2.0-amzn-0.jar:?]
at org.apache.iceberg.arrow.vectorized.VectorizedArrowReader.read(VectorizedArrowReader.java:132) ~[iceberg-spark-runtime-3.3_2.12-1.2.0-amzn-0.jar:?]
at org.apache.iceberg.spark.data.vectorized.ColumnarBatchReader$ColumnBatchLoader.readDataToColumnVectors(ColumnarBatchReader.java:123) ~[iceberg-spark-runtime-3.3_2.12-1.2.0-amzn-0.jar:?]
at org.apache.iceberg.spark.data.vectorized.ColumnarBatchReader$ColumnBatchLoader.loadDataToColumnBatch(ColumnarBatchReader.java:98) ~[iceberg-spark-runtime-3.3_2.12-1.2.0-amzn-0.jar:?]
at org.apache.iceberg.spark.data.vectorized.ColumnarBatchReader.read(ColumnarBatchReader.java:72) ~[iceberg-spark-runtime-3.3_2.12-1.2.0-amzn-0.jar:?]
at org.apache.iceberg.spark.data.vectorized.ColumnarBatchReader.read(ColumnarBatchReader.java:44) ~[iceberg-spark-runtime-3.3_2.12-1.2.0-amzn-0.jar:?]
at org.apache.iceberg.parquet.VectorizedParquetReader$FileIterator.next(VectorizedParquetReader.java:147) ~[iceberg-spark-runtime-3.3_2.12-1.2.0-amzn-0.jar:?]
at org.apache.iceberg.spark.source.BaseReader.next(BaseReader.java:136) ~[iceberg-spark-runtime-3.3_2.12-1.2.0-amzn-0.jar:?]
at org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:119) ~[spark-sql_2.12-3.3.2-amzn-0.jar:3.3.2-amzn-0]
at org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:156) ~[spark-sql_2.12-3.3.2-amzn-0.jar:3.3.2-amzn-0]
at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.$anonfun$hasNext$1(DataSourceRDD.scala:63) ~[spark-sql_2.12-3.3.2-amzn-0.jar:3.3.2-amzn-0]
at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.$anonfun$hasNext$1$adapted(DataSourceRDD.scala:63) ~[spark-sql_2.12-3.3.2-amzn-0.jar:3.3.2-amzn-0]
at scala.Option.exists(Option.scala:376) ~[scala-library-2.12.15.jar:?]
at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:63) ~[spark-sql_2.12-3.3.2-amzn-0.jar:3.3.2-amzn-0]
at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.advanceToNextIter(DataSourceRDD.scala:97) ~[spark-sql_2.12-3.3.2-amzn-0.jar:3.3.2-amzn-0]
at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:63) ~[spark-sql_2.12-3.3.2-amzn-0.jar:3.3.2-amzn-0]
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) ~[spark-core_2.12-3.3.2-amzn-0.jar:3.3.2-amzn-0]
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) ~[scala-library-2.12.15.jar:?]
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source) ~[?:?]
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.hashAgg_doAggregateWithoutKey_0$(Unknown Source) ~[?:?]
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) ~[?:?]
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:35) ~[spark-sql_2.12-3.3.2-amzn-0.jar:3.3.2-amzn-0]
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.hasNext(Unknown Source) ~[?:?]
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:955) ~[spark-sql_2.12-3.3.2-amzn-0.jar:3.3.2-amzn-0]
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) ~[scala-library-2.12.15.jar:?]
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:142) ~[spark-core_2.12-3.3.2-amzn-0.jar:3.3.2-amzn-0]
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) ~[spark-core_2.12-3.3.2-amzn-0.jar:3.3.2-amzn-0]
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) ~[spark-core_2.12-3.3.2-amzn-0.jar:3.3.2-amzn-0]
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) ~[spark-core_2.12-3.3.2-amzn-0.jar:3.3.2-amzn-0]
at org.apache.spark.scheduler.Task.run(Task.scala:138) ~[spark-core_2.12-3.3.2-amzn-0.jar:3.3.2-amzn-0]
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548) ~[spark-core_2.12-3.3.2-amzn-0.jar:3.3.2-amzn-0]
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1516) ~[spark-core_2.12-3.3.2-amzn-0.jar:3.3.2-amzn-0]
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551) ~[spark-core_2.12-3.3.2-amzn-0.jar:3.3.2-amzn-0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
at java.lang.Thread.run(Thread.java:840) ~[?:?]

More info:

The datatype in the table schema for the time_dt field that causes the exception is timestamp.
Iceberg considers timestamps to be microsecond granularity: https://iceberg.apache.org/spec/#primitive-types
The underlying data in the table for the time_dt field is a timestamp in millisecond granularity

The exception comes from here: https://github.com/apache/iceberg/blob/1.2.x/arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java#L273

It looks like the TimeStampMicroVector is coming from here: https://github.com/apache/iceberg/blob/main/arrow/src/main/java/org/apache/iceberg/arrow/ArrowSchemaUtil.java#L103-L107

How can one reproduce the bug?
Steps to reproduce the behavior:
The exact mechanism to reproduce this is unknown. The below query causes the exception:

SELECT accountid, region, count(*) as total FROM <table> WHERE accountid in ('<redacted>') AND region = 'us-east-1' AND time_dt BETWEEN CURRENT_TIMESTAMP - INTERVAL '1' MONTH AND CURRENT_TIMESTAMP GROUP BY accountid, region ORDER BY total DESC

What is the expected behavior?
The query should execute successfully instead of throwing a ClassCastException

What is your host/environment?

Version: 0.4.0

Do you have any screenshots?
If applicable, add screenshots to help explain your problem.

Do you have any additional context?
Add any other context about the problem.

The text was updated successfully, but these errors were encountered:

dai-chen · 2024-08-09T17:38:56Z

Just trying to understand: is this a bug in Spark Iceberg reader itself?

engechas · 2024-08-12T15:53:30Z

Yes it looks like a bug in the Spark Iceberg reader

dai-chen · 2024-08-12T15:55:54Z

Yes it looks like a bug in the Spark Iceberg reader

Thanks for confirming! If possible, could you test it with Spark 3.5 because we've bumped the version and planning to release 0.5 soon.

engechas · 2024-08-12T22:50:36Z

Peng encountered this in some of his testing with EMRs 7.2/Spark 3.5 so doesn't look like the version bump will fix it unfortunately

anirudha · 2024-08-22T06:12:29Z

whats the path ahead here?

engechas added bug Something isn't working untriaged labels Aug 2, 2024

dai-chen removed the untriaged label Aug 9, 2024

dai-chen mentioned this issue Sep 18, 2024

[EPIC] Zero-ETL - Apache Iceberg Table Support #372

Open

dai-chen added the DataSource:Iceberg label Sep 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Timestamp queries with Iceberg throw ClassCastException #511

[BUG] Timestamp queries with Iceberg throw ClassCastException #511

engechas commented Aug 2, 2024

dai-chen commented Aug 9, 2024 •

edited

Loading

engechas commented Aug 12, 2024

dai-chen commented Aug 12, 2024

engechas commented Aug 12, 2024

anirudha commented Aug 22, 2024

[BUG] Timestamp queries with Iceberg throw ClassCastException #511

[BUG] Timestamp queries with Iceberg throw ClassCastException #511

Comments

engechas commented Aug 2, 2024

dai-chen commented Aug 9, 2024 • edited Loading

engechas commented Aug 12, 2024

dai-chen commented Aug 12, 2024

engechas commented Aug 12, 2024

anirudha commented Aug 22, 2024

dai-chen commented Aug 9, 2024 •

edited

Loading