SNOW-1787322 Fix InsertError for structured data type #888

sfc-gh-alhuang · 2024-11-05T00:05:37Z

Currently the InsertError doesn't populate extra columns, missing columns, null value for non null columns in InsertValidationResponse when ingesting structured data type to Iceberg tables. The PR is fixing this.

We use parquet dot path with escaping dot and back slash character in field name to represent sub-columns. For example, column x in map_col(string, object("a.b" array(object("c\d" object(x int))))) has a path MAP_COL.key_value.value.a\.b.list.element.c\\d.x

sfc-gh-hmadan · 2024-11-05T06:59:16Z

src/main/java/net/snowflake/ingest/streaming/InsertValidationResponse.java

+     *
+     * @param missingNotNullColName the missing non-nullable column name
+     */
+    public void addMissingNotNullColName(String missingNotNullColName) {


any chance this can be called from multiple threads?

Insert error instance are independent between rows, I think this should be safe.

src/main/java/net/snowflake/ingest/streaming/internal/AbstractRowBuffer.java

src/main/java/net/snowflake/ingest/streaming/internal/IcebergParquetValueParser.java

sfc-gh-hmadan · 2024-11-05T07:20:37Z

src/main/java/net/snowflake/ingest/streaming/internal/IcebergParquetValueParser.java

+                error);
+        listVal.add(parsedValue.getValue());
+        estimatedParquetSize += parsedValue.getSize();
+      } else {


this else block is (a) not incrementing estimatedParquetSize, and (b) not adding to listVal in one branch (when required=true). Unclear why you really need this else block?
You can also do the following I think and avoid the unnecessary if-else branching?

String fieldName = type.getFieldName(i); Object val = structVal.getOrDefault(fieldName, null); ParquetBufferValue parsedValue = parseColumnValueToParquet(val, ....); listVal.add(parsedValue.getValue()); estParquetSize = ...; if (type.getType(i).isRepetition(REQUIRED)) { missingFields.add(fieldName); }

The reason to add this else is to distinguish between missing column and column with null value. That is, we cannot use getOrDefault(fieldName, null) as this ruined the difference between null value and missing key.

ah, got it. please add a comment around this.

src/main/java/net/snowflake/ingest/streaming/internal/IcebergParquetValueParser.java

src/main/java/net/snowflake/ingest/utils/SubColumnFinder.java

sfc-gh-hmadan · 2024-11-06T03:59:02Z

src/main/java/net/snowflake/ingest/utils/Utils.java

      }
      if (sb.length() > 0) {
        sb.append(".");
      }
-      sb.append(p);
+      sb.append(p.replace("\\", "\\\\").replace(".", "\\."));


does an empty string in p need to also be replaced by something? Have IT for an empty field name?

I don't think empty string does not need to be replaced. An empty string does not collision with other dot path. An IT with empty fields was included in previous PR in IcebergStructuredIT.testFieldName.

sfc-gh-hmadan · 2024-11-06T04:17:15Z

src/test/java/net/snowflake/ingest/streaming/internal/IcebergParquetValueParserTest.java

@@ -85,7 +85,7 @@ public void parseValueInt() {
        };
    ParquetBufferValue pv =
        IcebergParquetValueParser.parseColumnValueToParquet(
-            Integer.MAX_VALUE, type, rowBufferStatsMap, mockSubColumnFinder, UTC, 0);
+            Integer.MAX_VALUE, type, rowBufferStatsMap, mockSubColumnFinder, UTC, 0, null);


thanks for not adding a method overload :)

sfc-gh-hmadan · 2024-11-06T04:22:37Z

src/test/java/net/snowflake/ingest/streaming/internal/datatypes/IcebergStructuredIT.java

+            "object(k1 int not null, k2 object(k3 int not null, k4 object(k5 int not null) not"
+                + " null) not null) not null");
+    SnowflakeStreamingIngestChannel channel =
+        openChannel(tableName, OpenChannelRequest.OnErrorOption.ABORT);


looks like a pre-existing gap where someone using OnErrorOption.ABORT will not get the same set of information (insertErrors object) as OnErrorOption.CONTINUE ? (unrelated to your PR)

For single row, errors on different OnErrorOption should be the same.

sfc-gh-hmadan · 2024-11-06T04:33:40Z

src/test/java/net/snowflake/ingest/streaming/internal/datatypes/IcebergStructuredIT.java

+        channel.insertRow(createStreamingIngestRow(row), UUID.randomUUID().toString());
+    assertThat(insertValidationResponse.getInsertErrors().size()).isEqualTo(1);
+    assertThat(insertValidationResponse.getInsertErrors().get(0).getExtraColNames())
+        .containsOnly("VALUE.k2", "VALUE.k\\.3", "VALUE.k\\\\4");


FOR KC team: The first VALUE is the default column name used by the test harness, and doesn't have any semantics. This is unlike the key-value/list/element/value words that you'll see in the examples that follow.

sfc-gh-hmadan · 2024-11-06T04:39:08Z

Signing off but please hold off on merging until we get a positive ack from warsaw on this change being sufficient, slack conversation ongoing.

sfc-gh-alhuang requested review from sfc-gh-tzhang and a team as code owners November 5, 2024 00:05

sfc-gh-alhuang requested a review from sfc-gh-hmadan November 5, 2024 00:05