Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Timestamp queries with Iceberg throw ClassCastException #511

Open
engechas opened this issue Aug 2, 2024 · 5 comments
Open

[BUG] Timestamp queries with Iceberg throw ClassCastException #511

engechas opened this issue Aug 2, 2024 · 5 comments
Labels
bug Something isn't working DataSource:Iceberg

Comments

@engechas
Copy link
Contributor

engechas commented Aug 2, 2024

What is the bug?
When running certain queries that involve timestamp fields against Iceberg tables an exception is thrown during query execution:

24/07/26 20:16:19 ERROR Executor: Exception in task 4.3 in stage 0.0 (TID 17)
--
java.lang.ClassCastException: class org.apache.iceberg.shaded.org.apache.arrow.vector.TimeStampMicroVector cannot be cast to class org.apache.iceberg.shaded.org.apache.arrow.vector.BigIntVector (org.apache.iceberg.shaded.org.apache.arrow.vector.TimeStampMicroVector and org.apache.iceberg.shaded.org.apache.arrow.vector.BigIntVector are in unnamed module of loader 'app')
at org.apache.iceberg.arrow.vectorized.VectorizedArrowReader.allocateVectorBasedOnOriginalType(VectorizedArrowReader.java:273) ~[iceberg-spark-runtime-3.3_2.12-1.2.0-amzn-0.jar:?]
at org.apache.iceberg.arrow.vectorized.VectorizedArrowReader.allocateFieldVector(VectorizedArrowReader.java:218) ~[iceberg-spark-runtime-3.3_2.12-1.2.0-amzn-0.jar:?]
at org.apache.iceberg.arrow.vectorized.VectorizedArrowReader.read(VectorizedArrowReader.java:132) ~[iceberg-spark-runtime-3.3_2.12-1.2.0-amzn-0.jar:?]
at org.apache.iceberg.spark.data.vectorized.ColumnarBatchReader$ColumnBatchLoader.readDataToColumnVectors(ColumnarBatchReader.java:123) ~[iceberg-spark-runtime-3.3_2.12-1.2.0-amzn-0.jar:?]
at org.apache.iceberg.spark.data.vectorized.ColumnarBatchReader$ColumnBatchLoader.loadDataToColumnBatch(ColumnarBatchReader.java:98) ~[iceberg-spark-runtime-3.3_2.12-1.2.0-amzn-0.jar:?]
at org.apache.iceberg.spark.data.vectorized.ColumnarBatchReader.read(ColumnarBatchReader.java:72) ~[iceberg-spark-runtime-3.3_2.12-1.2.0-amzn-0.jar:?]
at org.apache.iceberg.spark.data.vectorized.ColumnarBatchReader.read(ColumnarBatchReader.java:44) ~[iceberg-spark-runtime-3.3_2.12-1.2.0-amzn-0.jar:?]
at org.apache.iceberg.parquet.VectorizedParquetReader$FileIterator.next(VectorizedParquetReader.java:147) ~[iceberg-spark-runtime-3.3_2.12-1.2.0-amzn-0.jar:?]
at org.apache.iceberg.spark.source.BaseReader.next(BaseReader.java:136) ~[iceberg-spark-runtime-3.3_2.12-1.2.0-amzn-0.jar:?]
at org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:119) ~[spark-sql_2.12-3.3.2-amzn-0.jar:3.3.2-amzn-0]
at org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:156) ~[spark-sql_2.12-3.3.2-amzn-0.jar:3.3.2-amzn-0]
at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.$anonfun$hasNext$1(DataSourceRDD.scala:63) ~[spark-sql_2.12-3.3.2-amzn-0.jar:3.3.2-amzn-0]
at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.$anonfun$hasNext$1$adapted(DataSourceRDD.scala:63) ~[spark-sql_2.12-3.3.2-amzn-0.jar:3.3.2-amzn-0]
at scala.Option.exists(Option.scala:376) ~[scala-library-2.12.15.jar:?]
at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:63) ~[spark-sql_2.12-3.3.2-amzn-0.jar:3.3.2-amzn-0]
at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.advanceToNextIter(DataSourceRDD.scala:97) ~[spark-sql_2.12-3.3.2-amzn-0.jar:3.3.2-amzn-0]
at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:63) ~[spark-sql_2.12-3.3.2-amzn-0.jar:3.3.2-amzn-0]
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) ~[spark-core_2.12-3.3.2-amzn-0.jar:3.3.2-amzn-0]
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) ~[scala-library-2.12.15.jar:?]
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source) ~[?:?]
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.hashAgg_doAggregateWithoutKey_0$(Unknown Source) ~[?:?]
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) ~[?:?]
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:35) ~[spark-sql_2.12-3.3.2-amzn-0.jar:3.3.2-amzn-0]
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.hasNext(Unknown Source) ~[?:?]
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:955) ~[spark-sql_2.12-3.3.2-amzn-0.jar:3.3.2-amzn-0]
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) ~[scala-library-2.12.15.jar:?]
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:142) ~[spark-core_2.12-3.3.2-amzn-0.jar:3.3.2-amzn-0]
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) ~[spark-core_2.12-3.3.2-amzn-0.jar:3.3.2-amzn-0]
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) ~[spark-core_2.12-3.3.2-amzn-0.jar:3.3.2-amzn-0]
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) ~[spark-core_2.12-3.3.2-amzn-0.jar:3.3.2-amzn-0]
at org.apache.spark.scheduler.Task.run(Task.scala:138) ~[spark-core_2.12-3.3.2-amzn-0.jar:3.3.2-amzn-0]
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548) ~[spark-core_2.12-3.3.2-amzn-0.jar:3.3.2-amzn-0]
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1516) ~[spark-core_2.12-3.3.2-amzn-0.jar:3.3.2-amzn-0]
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551) ~[spark-core_2.12-3.3.2-amzn-0.jar:3.3.2-amzn-0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
at java.lang.Thread.run(Thread.java:840) ~[?:?]

More info:

  • The datatype in the table schema for the time_dt field that causes the exception is timestamp.
  • Iceberg considers timestamps to be microsecond granularity: https://iceberg.apache.org/spec/#primitive-types
  • The underlying data in the table for the time_dt field is a timestamp in millisecond granularity

The exception comes from here: https://github.com/apache/iceberg/blob/1.2.x/arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java#L273

It looks like the TimeStampMicroVector is coming from here: https://github.com/apache/iceberg/blob/main/arrow/src/main/java/org/apache/iceberg/arrow/ArrowSchemaUtil.java#L103-L107

How can one reproduce the bug?
Steps to reproduce the behavior:
The exact mechanism to reproduce this is unknown. The below query causes the exception:

SELECT accountid, region, count(*) as total FROM <table> WHERE accountid in ('<redacted>') AND region = 'us-east-1' AND time_dt BETWEEN CURRENT_TIMESTAMP - INTERVAL '1' MONTH AND CURRENT_TIMESTAMP GROUP BY accountid, region ORDER BY total DESC

What is the expected behavior?
The query should execute successfully instead of throwing a ClassCastException

What is your host/environment?

  • Version: 0.4.0

Do you have any screenshots?
If applicable, add screenshots to help explain your problem.

Do you have any additional context?
Add any other context about the problem.

@engechas engechas added bug Something isn't working untriaged labels Aug 2, 2024
@dai-chen dai-chen removed the untriaged label Aug 9, 2024
@dai-chen
Copy link
Collaborator

dai-chen commented Aug 9, 2024

Just trying to understand: is this a bug in Spark Iceberg reader itself?

@engechas
Copy link
Contributor Author

Yes it looks like a bug in the Spark Iceberg reader

@dai-chen
Copy link
Collaborator

Yes it looks like a bug in the Spark Iceberg reader

Thanks for confirming! If possible, could you test it with Spark 3.5 because we've bumped the version and planning to release 0.5 soon.

@engechas
Copy link
Contributor Author

Peng encountered this in some of his testing with EMRs 7.2/Spark 3.5 so doesn't look like the version bump will fix it unfortunately

@anirudha
Copy link
Collaborator

whats the path ahead here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working DataSource:Iceberg
Projects
None yet
Development

No branches or pull requests

3 participants