Parquet: Add readers and writers for the internal object model #11904

ajantha-bhat · 2025-01-03T16:27:41Z

Splitted into 3 commits,

a) Refactor BaseParquetWriter to only keep common functionality required for internal and generic writer.
b) Refactor BaseParquetReaders to only keep common functionality required for internal and generic reader.
c) Add internal writer and reader that consumes and produces the Iceberg in-memory data model.

parquet/src/main/java/org/apache/iceberg/data/parquet/InternalReader.java

parquet/src/test/java/org/apache/iceberg/parquet/TestInternalWriter.java

rdblue · 2025-01-07T20:26:22Z

.palantir/revapi.yml

+    - code: "java.method.abstractMethodAdded"
+      new: "method org.apache.iceberg.parquet.ParquetValueReaders.PrimitiveReader<?>\
+        \ org.apache.iceberg.data.parquet.BaseParquetReaders<T>::fixedReader(org.apache.parquet.column.ColumnDescriptor)"
+      justification: "{Refactor Parquet reader and writer}"


Why are there curly braces in the justification text?

rdblue · 2025-01-07T20:42:30Z

parquet/src/main/java/org/apache/iceberg/parquet/ParquetValueReaders.java

+
+    @Override
+    public UUID read(UUID reuse) {
+      return UUIDUtil.convert(column.nextBinary().toByteBuffer());


This looks fine to me.

rdblue · 2025-01-07T21:32:20Z

parquet/src/main/java/org/apache/iceberg/data/parquet/InternalWriter.java

+  }
+
+  private static class LogicalTypeWriterVisitor
+      implements LogicalTypeAnnotation.LogicalTypeAnnotationVisitor<


I think it would be better to import LogicalTypeAnnotation.LogicalTypeAnnotationVisitor and ParquetValueWriters.PrimitiveWriter to fix the formatting here. I know it matches what was copied, but that had been auto-formatted when the project moved to Google style.

rdblue · 2025-01-07T21:33:30Z

parquet/src/main/java/org/apache/iceberg/data/parquet/InternalWriter.java

+    return ParquetValueWriters.byteBuffers(desc);
+  }
+
+  private static class ParquetStructWriter extends StructWriter<StructLike> {


I think this should be StructLikeWriter. This is in the parquet package and there isn't much value to adding it to this name.

rdblue · 2025-01-07T21:33:54Z

parquet/src/main/java/org/apache/iceberg/data/parquet/InternalWriter.java

+  @Override
+  protected ParquetValueWriters.PrimitiveWriter<?> fixedWriter(ColumnDescriptor desc) {
+    // accepts ByteBuffer and internally writes as binary.
+    return ParquetValueWriters.byteBuffers(desc);


Make sure this writer checks the length of the incoming bytes.

ok. Also, the existing code of GenericParquetWriter also not having this length check while writing as byte[], I will add that check.

rdblue · 2025-01-07T21:35:07Z

parquet/src/main/java/org/apache/iceberg/data/parquet/InternalWriter.java

+
+  @Override
+  protected ParquetValueWriters.PrimitiveWriter<?> fixedWriter(ColumnDescriptor desc) {
+    // accepts ByteBuffer and internally writes as binary.


I don't think this comment is very helpful. Probably remove it.

rdblue · 2025-01-07T21:41:35Z

parquet/src/main/java/org/apache/iceberg/data/parquet/InternalReader.java

+    return new ParquetValueReaders.UnboxedReader<>(desc);
+  }
+
+  private static class ParquetStructReader extends StructReader<StructLike, StructLike> {


Here also, there's not much value in using Parquet in the class name. Since this will produce GenericRecord instances, how about RecordReader?

When checking that name (RecordReader) for consistency, I noticed that there's already a RecordReader in GenericParquetReaders. You can reuse that class.

Cannot reuse the class from GenericParquetReaders as it is based on Record interface, we need a class based on StructLike interface.

I will rename to StructLikeReader, just like the StructLikeWriter from InternalWriter class.

rdblue · 2025-01-07T22:04:52Z

parquet/src/main/java/org/apache/iceberg/data/parquet/InternalReader.java

+  @Override
+  protected ParquetValueReaders.PrimitiveReader<?> int96Reader(ColumnDescriptor desc) {
+    // normal handling as int96
+    return new ParquetValueReaders.UnboxedReader<>(desc);


This isn't correct. The unboxed reader will return a Binary for int96 columns. Instead, this needs to use the same logic as the Spark reader (which also uses the internal representation):

private static class TimestampInt96Reader extends UnboxedReader<Long> { TimestampInt96Reader(ColumnDescriptor desc) { super(desc); } @Override public Long read(Long ignored) { return readLong(); } @Override public long readLong() { final ByteBuffer byteBuffer = column.nextBinary().toByteBuffer().order(ByteOrder.LITTLE_ENDIAN); return ParquetUtil.extractTimestampInt96(byteBuffer); } }

You can move that class into the parquet package to share it.

rdblue · 2025-01-07T22:44:17Z

parquet/src/main/java/org/apache/iceberg/data/parquet/BaseParquetReaders.java

@@ -359,10 +250,10 @@ public ParquetValueReader<?> primitive(

      ColumnDescriptor desc = type.getColumnDescription(currentPath());

-      if (primitive.getOriginalType() != null) {
+      if (primitive.getLogicalTypeAnnotation() != null) {


I agree with this change, but please point these kinds of changes out for reviewers.

The old version worked because all of the supported logical type annotations had an equivalent ConvertedType (which is what OriginalType is called in Parquet format and the logical type docs).

rdblue · 2025-01-07T22:53:43Z

parquet/src/main/java/org/apache/iceberg/data/parquet/BaseParquetReaders.java

-    @Override
-    public Optional<ParquetValueReader<?>> visit(
-        LogicalTypeAnnotation.DateLogicalTypeAnnotation dateLogicalType) {
-      return Optional.of(new DateReader(desc));


This and the following 2 methods are the only changes between the implementations of this class, so a lot of code is duplicated. In addition, this already introduces abstract factory methods for some readers -- including timestamps. I think it would be much cleaner to reuse this and call factory methods instead:

protected abstract PrimitiveReader<?> dateReader(ColumnDescriptor desc); protected abstract PrimitiveReader<?> timeReader(ColumnDescriptor desc, ChronoUnit unit); protected abstract PrimitiveReader<?> timestampReader(ColumnDescriptor desc, ChronoUnit unit, boolean isAdjustedToUTC);

rdblue · 2025-01-07T22:54:07Z

parquet/src/main/java/org/apache/iceberg/data/parquet/BaseParquetReaders.java

@@ -76,6 +64,16 @@ protected ParquetValueReader<T> createReader(
  protected abstract ParquetValueReader<T> createStructReader(
      List<Type> types, List<ParquetValueReader<?>> fieldReaders, Types.StructType structType);

+  protected abstract LogicalTypeAnnotation.LogicalTypeAnnotationVisitor<ParquetValueReader<?>>


I don't think it makes sense to have the subclasses provide this visitor.

rdblue · 2025-01-07T22:54:52Z

parquet/src/main/java/org/apache/iceberg/data/parquet/GenericParquetReaders.java

+  private static final OffsetDateTime EPOCH = Instant.ofEpochSecond(0).atOffset(ZoneOffset.UTC);
+  private static final LocalDate EPOCH_DAY = EPOCH.toLocalDate();
+
+  private static class DateReader extends ParquetValueReaders.PrimitiveReader<LocalDate> {


I agree with moving the date/time reader classes here.

rdblue · 2025-01-07T22:57:17Z

parquet/src/main/java/org/apache/iceberg/data/parquet/InternalReader.java

+    @Override
+    public Optional<ParquetValueReader<?>> visit(
+        LogicalTypeAnnotation.TimestampLogicalTypeAnnotation timestampLogicalType) {
+      return Optional.of(new ParquetValueReaders.UnboxedReader<>(desc));


This isn't correct. The unit of the incoming timestamp value still needs to be handled, even if the in-memory representation of the value is the same (a long).

Looks like the Spark implementations for this should work well, just like the int96 cases.

rdblue · 2025-01-07T23:03:26Z

parquet/src/main/java/org/apache/iceberg/data/parquet/InternalReader.java

+    @Override
+    public Optional<ParquetValueReader<?>> visit(
+        LogicalTypeAnnotation.TimeLogicalTypeAnnotation timeLogicalType) {
+      return Optional.of(new ParquetValueReaders.UnboxedReader<>(desc));


This isn't correct. Like timestamp, this needs to handle the unit of the incoming value. In addition, millisecond values must annotate an int32 according to the Parquet logical type docs. When the unit is a millisecond value, this needs to call readInt and multiply by 1000.

It looks like both Spark currently gets the underlying Parquet type for milliseconds wrong (which makes sense because this is never used in Spark). We can go ahead and fix this now and share the reader between Internal and Spark.

private static class TimestampMillisReader extends UnboxedReader<Long> { TimestampMillisReader(ColumnDescriptor desc) { super(desc); } @Override public Long read(Long ignored) { return readLong(); } @Override public long readLong() { return 1000L * column.nextInteger(); } }

rdblue · 2025-01-07T23:05:20Z

parquet/src/main/java/org/apache/iceberg/data/parquet/GenericParquetWriter.java

+
+  @Override
+  protected ParquetValueWriters.PrimitiveWriter<?> fixedWriter(ColumnDescriptor desc) {
+    // accepts byte[] and internally writes as binary.


Nit: unhelpful comment.

rdblue · 2025-01-07T23:05:42Z

parquet/src/main/java/org/apache/iceberg/data/parquet/GenericParquetWriter.java

+  protected LogicalTypeAnnotation.LogicalTypeAnnotationVisitor<
+          ParquetValueWriters.PrimitiveWriter<?>>
+      logicalTypeWriterVisitor(ColumnDescriptor desc) {
+    return new LogicalTypeWriterVisitor(desc);


Here, I would also prefer not to have subclasses provide the visitor.

rdblue · 2025-01-07T23:08:04Z

parquet/src/test/java/org/apache/iceberg/parquet/TestInternalWriter.java

+import org.junit.jupiter.api.Test;
+import org.junit.jupiter.api.io.TempDir;
+
+public class TestInternalWriter {


As with the Avro tests, I think this should extend DataTest. It is probably easier to do the Avro work first and then reuse it here.

rdblue · 2025-01-07T23:10:38Z

.palantir/revapi.yml

+        \ org.apache.iceberg.data.parquet.BaseParquetReaders<T>::logicalTypeReaderVisitor(org.apache.parquet.column.ColumnDescriptor,\
+        \ org.apache.iceberg.types.Type.PrimitiveType, org.apache.parquet.schema.PrimitiveType)"
+      justification: "{Refactor Parquet reader and writer}"
+    - code: "java.method.abstractMethodAdded"


This PR should not introduce revapi failures. Instead, the new methods should have default implementations that match the previous behavior (returning the generic representations).

New methods are abstract and abstract method cannot have default implementation. So, I think we have to handle revapi failures.

Oh, I think what you mean is don't add it as abstract. Add it as methods with default implementation. I got it. I will update it today.

ajantha-bhat · 2025-01-11T08:03:35Z

api/src/test/java/org/apache/iceberg/util/RandomUtil.java

@@ -237,7 +237,7 @@ private static BigInteger randomUnscaled(int precision, Random random) {
  }

  public static List<Object> generateList(
-      Random random, Types.ListType list, Supplier<Object> elementResult) {


Addressing the nit from #11919

ajantha-bhat · 2025-01-11T08:04:53Z

data/src/test/java/org/apache/iceberg/data/parquet/TestInternalData.java

+import org.apache.iceberg.parquet.Parquet;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+
+public class TestInternalData extends DataTest {


Kept this class in data module instead of parquet because, parquet module don't have DataTest. It will be lot of code duplication to have it in parquet module.

ajantha-bhat · 2025-01-11T08:08:22Z

parquet/src/main/java/org/apache/iceberg/data/parquet/BaseParquetWriter.java

    }

    @Override
    public Optional<ParquetValueWriters.PrimitiveWriter<?>> visit(
        LogicalTypeAnnotation.TimeLogicalTypeAnnotation timeType) {
-      return Optional.of(new TimeWriter(desc));
+      Preconditions.checkArgument(


Added this check as for timestamp there was a check to process only MICROS.

ajantha-bhat · 2025-01-11T08:09:23Z

parquet/src/main/java/org/apache/iceberg/data/parquet/BaseParquetWriter.java

-      } else {
-        return Optional.of(new TimestampWriter(desc));
-      }
+      return timestampWriter(desc, timestampType.isAdjustedToUTC());


Should I pass a TimeUnit to avoid modifying the signature in the future when other units are supported? Also same for timeType

ajantha-bhat · 2025-01-11T08:12:03Z

parquet/src/main/java/org/apache/iceberg/parquet/ParquetValueWriters.java

+    }
+  }
+
+  private static class FixedWriter extends PrimitiveWriter<byte[]> {


I just moved this class from BaseParquetWriter, it was missing validation for length. I added it.

ajantha-bhat · 2025-01-11T08:14:45Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/data/SparkParquetReaders.java

@@ -373,41 +371,6 @@ public Decimal read(Decimal ignored) {
    }
  }

-  private static class TimestampMillisReader extends UnboxedReader<Long> {


Only removed for latest version of spark, thinking when we deprecate the older versions, this code will be removed.
Spark-3.3 is deprecated. Spark-3.4 Should I handle it?

ajantha-bhat · 2025-01-11T08:15:55Z

@rdblue: Thanks for the review. I have addressed the comments. Please take a look at it again.

ajantha-bhat marked this pull request as draft January 3, 2025 16:27

github-actions bot added the parquet label Jan 3, 2025

ajantha-bhat requested a review from rdblue January 3, 2025 16:30

ajantha-bhat commented Jan 3, 2025

View reviewed changes

parquet/src/main/java/org/apache/iceberg/data/parquet/InternalReader.java Show resolved Hide resolved

ajantha-bhat commented Jan 3, 2025

View reviewed changes

parquet/src/test/java/org/apache/iceberg/parquet/TestInternalWriter.java Outdated Show resolved Hide resolved

ajantha-bhat closed this Jan 4, 2025

ajantha-bhat reopened this Jan 4, 2025

ajantha-bhat force-pushed the parquet_internal_writer branch from 772f5c2 to 233a00b Compare January 6, 2025 11:33

rdblue reviewed Jan 7, 2025

View reviewed changes

rdblue changed the title ~~Parquet: Internal writer and reader~~ Parquet: Add readers and writers for the internal object model Jan 7, 2025

rdblue reviewed Jan 7, 2025

View reviewed changes

This was referenced Jan 9, 2025

Core: Relocate parquet to core #11716

Draft

Data: Add partition stats writer and reader #11216

Open

github-actions bot added the API label Jan 10, 2025

github-actions bot added spark core data labels Jan 10, 2025

ajantha-bhat force-pushed the parquet_internal_writer branch from de49dee to c977d2a Compare January 10, 2025 18:21

ajantha-bhat added 3 commits January 11, 2025 05:38

Parquet: Refactor BaseParquetWriter

fe2c208

Parquet: Refactor BaseParquetReaders

a2b449b

Parquet: Add internal writer and reader

cd8edea

ajantha-bhat force-pushed the parquet_internal_writer branch from c977d2a to 8a33e15 Compare January 11, 2025 00:08

github-actions bot removed the API label Jan 11, 2025

ajantha-bhat force-pushed the parquet_internal_writer branch from 8a33e15 to dae6c77 Compare January 11, 2025 08:02

github-actions bot added the API label Jan 11, 2025

ajantha-bhat commented Jan 11, 2025

View reviewed changes

ajantha-bhat marked this pull request as ready for review January 11, 2025 08:15

Address comments

3eaf3bc

ajantha-bhat force-pushed the parquet_internal_writer branch from dae6c77 to 3eaf3bc Compare January 11, 2025 13:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parquet: Add readers and writers for the internal object model #11904

Parquet: Add readers and writers for the internal object model #11904

ajantha-bhat commented Jan 3, 2025

rdblue Jan 7, 2025

rdblue Jan 7, 2025

rdblue Jan 7, 2025

rdblue Jan 7, 2025

rdblue Jan 7, 2025

ajantha-bhat Jan 9, 2025

rdblue Jan 7, 2025

rdblue Jan 7, 2025

rdblue Jan 7, 2025

ajantha-bhat Jan 9, 2025 •

edited

Loading

rdblue Jan 7, 2025

rdblue Jan 7, 2025

rdblue Jan 7, 2025

rdblue Jan 7, 2025

rdblue Jan 7, 2025

rdblue Jan 7, 2025

rdblue Jan 7, 2025

rdblue Jan 7, 2025

rdblue Jan 7, 2025

rdblue Jan 7, 2025

rdblue Jan 7, 2025

rdblue Jan 7, 2025

ajantha-bhat Jan 9, 2025

ajantha-bhat Jan 11, 2025

ajantha-bhat Jan 11, 2025

ajantha-bhat Jan 11, 2025

ajantha-bhat Jan 11, 2025

ajantha-bhat Jan 11, 2025

ajantha-bhat Jan 11, 2025

ajantha-bhat Jan 11, 2025

ajantha-bhat commented Jan 11, 2025

Parquet: Add readers and writers for the internal object model #11904

Are you sure you want to change the base?

Parquet: Add readers and writers for the internal object model #11904

Conversation

ajantha-bhat commented Jan 3, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ajantha-bhat Jan 9, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ajantha-bhat commented Jan 11, 2025

ajantha-bhat Jan 9, 2025 •

edited

Loading