Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Required field 'num_values' was not found in serialized data! #3084

Open
wardlican opened this issue Nov 28, 2024 · 1 comment
Open

Required field 'num_values' was not found in serialized data! #3084

wardlican opened this issue Nov 28, 2024 · 1 comment

Comments

@wardlican
Copy link

wardlican commented Nov 28, 2024

Describe the bug, including details regarding any error messages, version, and platform.

When using iceberg, we encountered a situation where a parquet file we wrote could not be read. When reading, the following error message appeared. Judging from the exception information, it is speculated that the parquet file is damaged or has not been written properly and cannot be parsed. We have also tried a variety of parsing tools but cannot parse it normally. However, the footer of the file is normal and the schema information of the file can be obtained, but the read data cannot be parsed the DataPageHeader. parquet version is 1.13.1. Is there any tool that can restore damaged files?

 org.apache.iceberg.exceptions.RuntimeIOException: java.io.IOException: can not read class org.apache.iceberg.shaded.org.apache.parquet.format.PageHeader: Required field 'num_values' was not found in serialized data! Struct: org.apache.iceberg.shaded.org.apache.parquet.format.DataPageHeader$DataPageHeaderStandardScheme@57eb7595
	at org.apache.iceberg.parquet.VectorizedParquetReader$FileIterator.advance(VectorizedParquetReader.java:165)
	at org.apache.iceberg.parquet.VectorizedParquetReader$FileIterator.next(VectorizedParquetReader.java:141)
	at org.apache.iceberg.spark.source.BaseReader.next(BaseReader.java:130)
	at org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:93)
	at org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:130)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.columnartorow_nextBatch_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.agg_doAggregateWithKeys_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1501)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)
Caused by: java.io.IOException: can not read class org.apache.iceberg.shaded.org.apache.parquet.format.PageHeader: Required field 'num_values' was not found in serialized data! Struct: org.apache.iceberg.shaded.org.apache.parquet.format.DataPageHeader$DataPageHeaderStandardScheme@57eb7595
	at org.apache.iceberg.shaded.org.apache.parquet.format.Util.read(Util.java:366)
	at org.apache.iceberg.shaded.org.apache.parquet.format.Util.readPageHeader(Util.java:133)
	at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader$Chunk.readPageHeader(ParquetFileReader.java:1458)
	at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:1505)
	at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:1478)
	at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.readChunkPages(ParquetFileReader.java:1088)
	at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.internalReadRowGroup(ParquetFileReader.java:956)
	at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:909)
	at org.apache.iceberg.parquet.VectorizedParquetReader$FileIterator.advance(VectorizedParquetReader.java:163)
	... 23 more
Caused by: org.apache.iceberg.shaded.org.apache.parquet.shaded.org.apache.thrift.protocol.TProtocolException: Required field 'num_values' was not found in serialized data! Struct: org.apache.iceberg.shaded.org.apache.parquet.format.DataPageHeader$DataPageHeaderStandardScheme@57eb7595
	at org.apache.iceberg.shaded.org.apache.parquet.format.DataPageHeader$DataPageHeaderStandardScheme.read(DataPageHeader.java:781)
	at org.apache.iceberg.shaded.org.apache.parquet.format.DataPageHeader$DataPageHeaderStandardScheme.read(DataPageHeader.java:719)
	at org.apache.iceberg.shaded.org.apache.parquet.format.DataPageHeader.read(DataPageHeader.java:624)
	at org.apache.iceberg.shaded.org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:1072)
	at org.apache.iceberg.shaded.org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:1025)
	at org.apache.iceberg.shaded.org.apache.parquet.format.PageHeader.read(PageHeader.java:902)
	at org.apache.iceberg.shaded.org.apache.parquet.format.Util.read(Util.java:363)
	... 31 more

Component(s)

Thrift

@wgtmac
Copy link
Member

wgtmac commented Nov 28, 2024

From the error message, it seems that the thrift deserialization uses a somewhat incompatible thrift protocol to deserialize parquet metadata. If the thrift protocol used to write the parquet file does not have a required num_values field in the DataPageHeader, or the thrift serializer has skipped writing 0 to that field, the deserializer will complain that message. Please check the writer version (from footer metadata of the malformed file) and try to use different parquet readers to read, though I don't know whether it works or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants