Structured data type support #798

sfc-gh-alhuang · 2024-07-25T00:19:48Z

Support structured data type for streaming to iceberg. Refer here for all supported data type. Test is expected to add.

src/main/java/net/snowflake/ingest/streaming/internal/ParquetTypeGenerator.java

sfc-gh-hmadan · 2024-07-25T05:07:01Z

src/main/java/net/snowflake/ingest/streaming/internal/ParquetRowBuffer.java

+                  defaultTimezone,
+                  insertRowsCurrIndex)
+              : IcebergParquetValueParser.parseColumnValueToParquet(
+                  value, parquetColumn.type, forkedStats, defaultTimezone, insertRowsCurrIndex));


no need to pass in columnMetadata / columnMetadata.sourceIcebergDataType?

I didn't use the column metadata as parquetColumn.type already include all information we need including precision, scale, length, etc.

sfc-gh-hmadan · 2024-07-25T05:09:21Z

src/main/java/org/apache/parquet/hadoop/BdecParquetWriter.java

        if (val != null) {
-          String fieldName = cols.get(i).getPath()[0];
+          String fieldName = cols.get(i).getName();


while this may be functionally equivalent, I'd not mix this into your current PR. Feel free to send a separate PR to replace getPath[0] with getName.

The .getPath()[0] was used against ColumnDescriptor. The .getName() is used on Parquet.Type as we need the information in Parquet.Type.

sfc-gh-hmadan · 2024-07-25T05:12:16Z

src/main/java/org/apache/parquet/hadoop/BdecParquetWriter.java

+              for (Object o : values) {
+                recordConsumer.startGroup();
+                if (o != null) {
+                  write((List<Object>) o, cols.get(i).asGroupType());


should validate recursion level and fail beyond a threshold?

sfc-gh-hmadan · 2024-07-25T18:44:28Z

src/main/java/net/snowflake/ingest/streaming/internal/IcebergParquetValueParser.java

+      long insertRowsCurrIndex) {
+    Utils.assertNotNull("Parquet column stats", stats);
+    float estimatedParquetSize = 0F;
+    estimatedParquetSize += ParquetBufferValue.DEFINITION_LEVEL_ENCODING_BYTE_LEN;


checkout the comment on this constant; we are now going to have repetition level so does that change the size estimation logic here?

Note that FDN tables are ALSO adding support for structured data types; arguably we'll have to add support in this SDK for them too. Our changes need to treat "iceberg data types" and "structured data type support" as separate but related concepts and not tie them up too tightly. That is, keep the door open for FDN data types + structured data type nesting.

Added repetition level encoding size estimation.

src/main/java/net/snowflake/ingest/streaming/internal/IcebergParquetValueParser.java

sfc-gh-hmadan · 2024-07-25T19:24:23Z

src/main/java/net/snowflake/ingest/streaming/internal/IcebergParquetValueParser.java

+    }
+
+    if (value == null) {
+      if (!type.isRepetition(Repetition.REQUIRED)) {


what about primitive types, that are non-nullable, and value is still null?

The logic was incorrect, fixed.

sfc-gh-hmadan · 2024-07-25T21:07:13Z

src/main/java/net/snowflake/ingest/streaming/internal/IcebergParquetValueParser.java

+          type.getName(), value, insertRowsCurrIndex, Integer.class);
+    }
+    if (logicalTypeAnnotation instanceof DecimalLogicalTypeAnnotation) {
+      return getDecimalValue(value, type, insertRowsCurrIndex).unscaledValue().intValue();


why not use columnMetadata.scale ?

also, columnMetadata.precision isn't needed here?

yep, checked documentation we do support custom precision and scale to be specified for managed iceberg tables' decimal fields. See here: https://docs.snowflake.com/en/user-guide/tables-iceberg-data-types

also applies to long decimal..

The purpose of this function is to parse java type to parquet bytes. The actual bytes of a parquet decimal column is an unscaled integer. The scanner later use the precision and scale in schema to infer the correct information.
The columnMetadata isn't needed bc the Parquet.Type already include scale and precision, used it here for decimal range validation.

src/main/java/net/snowflake/ingest/streaming/internal/DataValidationUtil.java

sfc-gh-hmadan · 2024-07-25T21:16:39Z

src/main/java/net/snowflake/ingest/streaming/internal/DataValidationUtil.java

+      String columnName, Object input, long insertRowIndex, Class<T> targetType) {
+    if (input instanceof Number) {
+      if (targetType.equals(Integer.class)) {
+        return targetType.cast(((Number) input).intValue());


Can you check what's the difference between type.class.cast() and Number.intValue / Number.longValue / Number.floatValue / Number.doubleValue ?

Tye type.class.cast is used for the return type check and it's no longer needed after we split it into 4 different methods. The Number.xValue do narrow casting if the source type cannot fit into target type.

src/main/java/net/snowflake/ingest/streaming/internal/DataValidationUtil.java

src/main/java/net/snowflake/ingest/streaming/internal/IcebergParquetValueParser.java

sfc-gh-hmadan · 2024-07-25T21:31:25Z

src/main/java/net/snowflake/ingest/streaming/internal/IcebergParquetValueParser.java

+          type.getName(), value, insertRowsCurrIndex, Long.class);
+    }
+    if (logicalTypeAnnotation instanceof DecimalLogicalTypeAnnotation) {
+      return getDecimalValue(value, type, insertRowsCurrIndex).unscaledValue().longValue();


.longValue() here and .intValue() in the previous method are guaranteed to not throw a casting exception?

According here, longValue and intValue returns last 4/8 bytes. I don't think exception will be thrown here.

sfc-gh-hmadan · 2024-07-25T21:33:39Z

src/main/java/net/snowflake/ingest/streaming/internal/IcebergParquetValueParser.java

+              defaultTimezone,
+              !includeTimeZone,
+              insertRowsCurrIndex)
+          .toBinary(false)


we are removing timezone in all cases, is that correct behavior?

Reference the logic here and Iceberg.timetamptz is converted to Snowflake.timestampltz (ref).

sfc-gh-hmadan · 2024-07-25T21:39:58Z

src/main/java/net/snowflake/ingest/streaming/internal/IcebergParquetValueParser.java

+    if (logicalTypeAnnotation == null) {
+      byte[] bytes =
+          DataValidationUtil.validateAndParseBinary(
+              type.getName(), value, Optional.empty(), insertRowsCurrIndex);


why not validate against maxLength here?

Added max length validation, we don't need the max length in column meta data as the max length of Iceberg.binary(Snowflake.binary(8MB)) and Iceberg.string(Snowflake.varchar(16MB)) are constants.

sfc-gh-hmadan · 2024-07-25T21:45:19Z

src/main/java/net/snowflake/ingest/streaming/internal/IcebergParquetValueParser.java

+    for (int i = 0; i < type.getFieldCount(); i++) {
+      ParquetBufferValue parsedValue =
+          parseColumnValueToParquet(
+              structVal.getOrDefault(type.getFieldName(i), null),


Is null a properly handled value for all types of fields..?

Afaik, the leaf primitive type will always check if it's nullable. Will add test to verify it.

sfc-gh-hmadan · 2024-07-25T21:46:54Z

src/main/java/net/snowflake/ingest/streaming/internal/IcebergParquetValueParser.java

+    }
+    if (logicalTypeAnnotation instanceof TimestampLogicalTypeAnnotation) {
+      boolean includeTimeZone =
+          ((TimestampLogicalTypeAnnotation) logicalTypeAnnotation).isAdjustedToUTC();


I see timestamp and timestamptz as two separate types in this document - are we handling both?
https://docs.snowflake.com/en/user-guide/tables-iceberg-data-types

Yes, Iceberg.timestamp and Iceberg.timestamptz use the same LogicalTypeAnnotation with different adjustToUtc. Ref

sfc-gh-hmadan · 2024-07-25T21:48:25Z

src/main/java/net/snowflake/ingest/streaming/internal/IcebergParquetValueParser.java

+          parseColumnValueToParquet(
+              structVal.getOrDefault(type.getFieldName(i), null),
+              type.getType(i),
+              stats,


we need the nested object's "metadata" to also be passed in for stuff like precision/scale/byte[] maxlength/etc.
However we won't have a ColumnMetadata object in that situation, which means we now need an abstraction over ColumnMetadata for use in the value parsers (I'm leaving another comment soon that we should look to have one ParquetValueParser class as there's too many similarities for them to be completely independent classes. Especially since we need to support structured data types for FDN types too some day)

Structured data type support

c581b31

sfc-gh-alhuang requested a review from sfc-gh-hmadan July 25, 2024 00:20

fix license

69d983f

sfc-gh-hmadan reviewed Jul 25, 2024

View reviewed changes

src/main/java/net/snowflake/ingest/streaming/internal/ParquetTypeGenerator.java Show resolved Hide resolved

sfc-gh-hmadan reviewed Jul 25, 2024

View reviewed changes

src/main/java/net/snowflake/ingest/streaming/internal/IcebergParquetValueParser.java Outdated Show resolved Hide resolved

sfc-gh-hmadan reviewed Jul 25, 2024

View reviewed changes

fix license

f020b1f

sfc-gh-alhuang force-pushed the alhuang-structured-datatype branch from c7452e0 to f020b1f Compare July 25, 2024 20:49

sfc-gh-hmadan reviewed Jul 25, 2024

View reviewed changes

src/main/java/net/snowflake/ingest/streaming/internal/DataValidationUtil.java Outdated Show resolved Hide resolved

sfc-gh-hmadan reviewed Jul 25, 2024

View reviewed changes

src/main/java/net/snowflake/ingest/streaming/internal/DataValidationUtil.java Outdated Show resolved Hide resolved

sfc-gh-hmadan reviewed Jul 25, 2024

View reviewed changes

src/main/java/net/snowflake/ingest/streaming/internal/IcebergParquetValueParser.java Show resolved Hide resolved

sfc-gh-hmadan reviewed Jul 25, 2024

View reviewed changes

sfc-gh-alhuang added 4 commits July 25, 2024 16:50

Add size estimation for repetition level.

d599cb6

Fix nullable check.

3325c1d

Add new table format test for ParquetTypeGenerator

89dd17c

Add IcebergParquetValueParserTest for primitive columns

b165b25

sfc-gh-alhuang closed this Aug 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Structured data type support #798

Structured data type support #798

sfc-gh-alhuang commented Jul 25, 2024 •

edited

Loading

sfc-gh-hmadan Jul 25, 2024

sfc-gh-alhuang Jul 25, 2024

sfc-gh-hmadan Jul 25, 2024

sfc-gh-alhuang Jul 25, 2024

sfc-gh-hmadan Jul 25, 2024

sfc-gh-hmadan Jul 25, 2024

sfc-gh-alhuang Jul 25, 2024

sfc-gh-hmadan Jul 25, 2024

sfc-gh-alhuang Jul 25, 2024

sfc-gh-hmadan Jul 25, 2024

sfc-gh-hmadan Jul 25, 2024

sfc-gh-hmadan Jul 25, 2024

sfc-gh-hmadan Jul 25, 2024

sfc-gh-alhuang Jul 25, 2024 •

edited

Loading

sfc-gh-hmadan Jul 25, 2024

sfc-gh-alhuang Jul 26, 2024

sfc-gh-hmadan Jul 25, 2024

sfc-gh-alhuang Jul 26, 2024

sfc-gh-hmadan Jul 25, 2024

sfc-gh-alhuang Jul 26, 2024 •

edited

Loading

sfc-gh-hmadan Jul 25, 2024

sfc-gh-alhuang Jul 26, 2024 •

edited

Loading

sfc-gh-hmadan Jul 25, 2024

sfc-gh-alhuang Jul 26, 2024

sfc-gh-hmadan Jul 25, 2024

sfc-gh-alhuang Jul 26, 2024

sfc-gh-hmadan Jul 25, 2024

Structured data type support #798

Structured data type support #798

Conversation

sfc-gh-alhuang commented Jul 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sfc-gh-alhuang Jul 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sfc-gh-alhuang Jul 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sfc-gh-alhuang Jul 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sfc-gh-alhuang commented Jul 25, 2024 •

edited

Loading

sfc-gh-alhuang Jul 25, 2024 •

edited

Loading

sfc-gh-alhuang Jul 26, 2024 •

edited

Loading

sfc-gh-alhuang Jul 26, 2024 •

edited

Loading