SNOW-1507007 Support schema for new table format #814

sfc-gh-alhuang · 2024-08-13T22:07:26Z

We use FDN-specific logical and physical data types only today. In this PR we change to use iceberg’s data types so there is no loss of signal between the table schema on the server versus what data type conversions are done in the client.

sfc-gh-hmadan · 2024-08-22T21:46:30Z

scripts/process_licenses.py

@@ -61,6 +61,8 @@
    "org.bouncycastle:bcpkix-jdk18on": BOUNCY_CASTLE_LICENSE,
    "org.bouncycastle:bcutil-jdk18on": BOUNCY_CASTLE_LICENSE,
    "org.bouncycastle:bcprov-jdk18on": BOUNCY_CASTLE_LICENSE,
+    "org.roaringbitmap:RoaringBitmap": APACHE_LICENSE,


can't find this library being used anywhere, why is this change needed?

It's a transitive dependency from org.apache.iceberg:iceberg-core.

[INFO] +- org.apache.iceberg:iceberg-core:jar:1.3.1:compile [INFO] | +- org.apache.iceberg:iceberg-common:jar:1.3.1:runtime [INFO] | +- org.apache.avro:avro:jar:1.11.1:runtime [INFO] | +- org.apache.httpcomponents.client5:httpclient5:jar:5.2.1:runtime [INFO] | | +- org.apache.httpcomponents.core5:httpcore5:jar:5.2:runtime [INFO] | | \- org.apache.httpcomponents.core5:httpcore5-h2:jar:5.2:runtime [INFO] | \- org.roaringbitmap:RoaringBitmap:jar:0.9.44:runtime [INFO] | \- org.roaringbitmap:shims:jar:0.9.44:runtime

src/main/java/net/snowflake/ingest/streaming/internal/ColumnMetadata.java

src/main/java/net/snowflake/ingest/streaming/internal/OpenChannelRequestInternal.java

src/main/java/org/apache/parquet/hadoop/BdecParquetWriter.java

src/main/java/net/snowflake/ingest/streaming/internal/DataValidationUtil.java

sfc-gh-hmadan · 2024-08-23T20:20:17Z

src/main/java/net/snowflake/ingest/streaming/internal/ParquetRowBuffer.java

      parquetTypes.add(typeInfo.getParquetType());
      this.metadata.putAll(typeInfo.getMetadata());
      int columnIndex = parquetTypes.size() - 1;
      fieldIndex.put(
          column.getInternalName(),
-          new ParquetColumn(column, columnIndex, typeInfo.getPrimitiveTypeName()));
+          new ParquetColumn(column, columnIndex, typeInfo.getParquetType()));


where do we validate that only primitive typed typeInfos are allowed for FDN tables?

The FDN type -> parquetType function generateColumnParquetTypeInfo only returns primitive type. Should we add assertion here?

src/main/java/net/snowflake/ingest/streaming/internal/ParquetRowBuffer.java

src/main/java/net/snowflake/ingest/streaming/internal/ParquetTypeInfo.java

sfc-gh-hmadan · 2024-08-23T21:32:10Z

src/main/java/net/snowflake/ingest/streaming/internal/ParquetTypeGenerator.java

+      org.apache.iceberg.types.Type icebergDataType =
+          deserializeIcebergType(column.getSourceIcebergDataType());
+      parquetType =
+          typeToMessageType.primitive(icebergDataType.asPrimitiveType(), repetition, id, name);


need error handling here for when a new iceberg data type is returned by service that the client doesn't know how to handle

Added check in IcebergDataTypeParser. If a new iceberg data type is introduced in server, Types.fromPrimitive will throw IllegalArgumentException, do you think this is sufficient?

src/main/java/net/snowflake/ingest/streaming/internal/ParquetTypeGenerator.java

sfc-gh-hmadan · 2024-08-23T21:36:45Z

src/main/java/net/snowflake/ingest/utils/IcebergDataTypeParser.java

+ * GlobalServices/modules/data-lake/datalake-api/src/main/java/com/snowflake/metadata/iceberg
+ * /IcebergDataTypeParser.java
+ */
+public class IcebergDataTypeParser {


This class is in utils package. Remain public since we need to use it from internal package.

for next PR: This will allow customers to directly instantiate this class because its public, lets discuss.

src/main/java/net/snowflake/ingest/utils/IcebergDataTypeParser.java

sfc-gh-hmadan · 2024-08-23T21:39:45Z

src/main/java/net/snowflake/ingest/utils/IcebergDataTypeParser.java

+      JsonNode json = MAPPER.readTree(icebergDataType);
+      return getTypeFromJson(json);
+    } catch (IOException e) {
+      throw new SFException(ErrorCode.INTERNAL_ERROR, "Failed to deserialize Iceberg data type", e);


I see in FDN case we throw UNKNOWN_DATA_TYPE ?

This exception is thrown cause json parse error, imo this is an internal error as this should not happen if the server side is passing valid json string.

src/main/java/net/snowflake/ingest/utils/IcebergDataTypeParser.java

sfc-gh-hmadan · 2024-08-23T21:43:47Z

src/main/java/net/snowflake/ingest/utils/IcebergDataTypeParser.java

+      }
+    }
+
+    throw new SFException(ErrorCode.INTERNAL_ERROR, "Cannot parse Iceberg type from: " + jsonNode);


instead of reserializing jsonNode I'd pass in the original string and write that out to logs / to the error message.

(P2) It might be a good idea to only throw SFExceptions from "higher" layers and not from utility parsing classes like this one, IllegalArgumentException probably is good enough here; lets chat with Toby on whether there's already a defined scheme to which all levels can/should throw SFException.

Changed all exception to IllegalArgurmentException in this class.

src/main/java/net/snowflake/ingest/utils/IcebergDataTypeParser.java

sfc-gh-tzhang · 2024-09-12T01:36:28Z

@sfc-gh-azagrebin @sfc-gh-lsembera could you help with reviewing Parquet writer and data type related changes in this PR? Thanks!

sfc-gh-lsembera · 2024-09-13T16:09:19Z

src/main/java/net/snowflake/ingest/streaming/internal/DataValidationUtil.java

+   */
+  static long validateAndParseIcebergLong(String columnName, Object input, long insertRowIndex) {
+    if (input instanceof Number) {
+      double value = ((Number) input).doubleValue();


This is a narrowing conversion, which loses precision, see:

long l1 = 499999999000000001L; double d = l1; long l2 = (long) d; System.out.println(l1); System.out.println(d); System.out.println(l2);

yields:

499999999000000001 4.99999999E17 499999999000000000

It is better to use BigDecimal.

Switched to BigDecimal.

src/main/java/net/snowflake/ingest/streaming/internal/DataValidationUtil.java

sfc-gh-lsembera · 2024-09-13T16:32:31Z

src/main/java/net/snowflake/ingest/streaming/internal/DataValidationUtil.java

+   * @param insertRowIndex Row index for error reporting
+   * @return Parsed integer
+   */
+  static int validateAndParseIcebergInt(String columnName, Object input, long insertRowIndex) {


Should we add unit tests for these functions, which focus on corner cases? Min and max values, Double/Float NaN+positive/negative infinity, big integers and big decimals outside of the allowed range, etc.?

Added tests here.

sfc-gh-lsembera

I reviewed the int/long validation in DataValidationUtil and it LGTM, left two small comments for your consideration.

For BDEC ingestion, we have a suite of integration tests testing data types end to end, i.e. including server-side scanning. I will leave it up to you if you think you need it for Parquet ingestion, as well.

sfc-gh-lsembera · 2024-09-18T13:42:10Z

src/test/java/net/snowflake/ingest/streaming/internal/DataValidationUtilTest.java

+  public void testValidateAndParseIcebergLong() {
+    assertEquals(1L, validateAndParseIcebergLong("COL", 1, 0));
+    assertEquals(1L, validateAndParseIcebergLong("COL", 1L, 0));
+    assertEquals(1L, validateAndParseIcebergLong("COL", 1.499f, 0));


Let's add a test for negative zero, as well. I believe it should be converted to positive zero.

cc @sfc-gh-azagrebin who was dealing with this issue.

Added test for -.0f. IT for all Iceberg types are expect to be added when server side Iceberg file registration is ready, backlogged to jira.

src/main/java/net/snowflake/ingest/streaming/internal/DataValidationUtil.java

src/main/java/net/snowflake/ingest/streaming/internal/OpenChannelRequestInternal.java

sfc-gh-hmadan · 2024-09-18T21:30:24Z

...ain/java/net/snowflake/ingest/streaming/internal/SnowflakeStreamingIngestClientInternal.java

+    if (isIcebergMode
+        && response.getTableColumns().stream()
+            .anyMatch(c -> c.getSourceIcebergDataType() == null)) {
+      throw new SFException(


for next PR: also log this out before throwing, and log the request id too.

sfc-gh-hmadan · 2024-09-18T21:35:29Z

src/main/java/net/snowflake/ingest/streaming/internal/ParquetTypeGenerator.java

-      default:
+    if (column.getSourceIcebergDataType() != null) {
+      parquetType =
+          IcebergDataTypeParser.parseIcebergDataTypeStringToParquetType(


for next PR: add testcase and throw proper error if an unknown data type is encountered

src/main/java/net/snowflake/ingest/streaming/internal/ParquetRowBuffer.java

sfc-gh-hmadan · 2024-09-18T21:39:57Z

src/main/java/net/snowflake/ingest/streaming/internal/ParquetRowBuffer.java

@@ -261,6 +263,10 @@ Object getVectorValueAt(String column, int index) {
    if (logicalType == ColumnLogicalType.BINARY && value != null) {
      value = value instanceof String ? ((String) value).getBytes(StandardCharsets.UTF_8) : value;
    }
+    /* Mismatch between Iceberg string & FDN String */
+    if (Objects.equals(columnMetadata.getSourceIcebergDataType(), "\"string\"")) {
+      value = value instanceof byte[] ? new String((byte[]) value, StandardCharsets.UTF_8) : value;


this is test-only code so np, but otherwise assuming the byte[] is UTF8 is a bit risky and shouldn't be done.

sfc-gh-hmadan · 2024-09-18T21:43:08Z

src/main/java/net/snowflake/ingest/streaming/internal/ParquetTypeGenerator.java

-            Types.primitive(PrimitiveType.PrimitiveTypeName.DOUBLE, repetition).id(id).named(name);
-        break;
-      default:
+    if (column.getSourceIcebergDataType() != null) {


for next PR: add validation that we only ever see column.sourceIcebergDataType as non-null when isIcebergMode is true. I saw you're already checking if this field is null on any column when we're in iceberg mode, need similar check for non-iceberg-mode.

sfc-gh-hmadan

Reviewed all files except the following (verified they're properly protected behind isIceberg flag, will get to them again later this week), signing off to unblock checkin.

IcebergParquetValueParser.java
IcebergDataTypeParser.java
DataValidationUtil.java
testcases

sfc-gh-hmadan · 2024-09-20T05:10:10Z

src/test/java/net/snowflake/ingest/streaming/internal/IcebergDataTypeParserTest.java

+    dataTypesToTest.add(new DataTypeInfo("\"date\"", Types.DateType.get()));
+    dataTypesToTest.add(new DataTypeInfo("\"time\"", Types.TimeType.get()));
+    dataTypesToTest.add(new DataTypeInfo("\"timestamptz\"", Types.TimestampType.withZone()));
+    dataTypesToTest.add(


the structured data type testing needs to cover more cases; and also account for nested schema parsing.

Added to jira.

sfc-gh-hmadan · 2024-09-20T05:11:43Z

src/main/java/org/apache/parquet/hadoop/BdecParquetWriter.java

        if (val != null) {
-          String fieldName = cols.get(i).getPath()[0];
+          String fieldName = cols.get(i).getName();


whats the difference between getPath()[0] and getName() ?

The getPath()[0] always returns the root column name. While getName() returns the current column/subcolumn name. This does matters for primitive data type column. Is needed for structured data type.

sfc-gh-alhuang marked this pull request as ready for review August 13, 2024 22:13

sfc-gh-alhuang requested review from sfc-gh-tzhang and a team as code owners August 13, 2024 22:13

sfc-gh-alhuang requested a review from sfc-gh-hmadan August 20, 2024 22:58

sfc-gh-hmadan reviewed Aug 22, 2024

View reviewed changes

src/main/java/net/snowflake/ingest/streaming/internal/ColumnMetadata.java Show resolved Hide resolved

sfc-gh-hmadan reviewed Aug 22, 2024

View reviewed changes

src/main/java/net/snowflake/ingest/streaming/internal/OpenChannelRequestInternal.java Show resolved Hide resolved