[Do not merg] TMP #866

sfc-gh-alhuang · 2024-10-18T22:48:44Z

No description provided.

sfc-gh-alhuang · 2024-10-18T22:50:40Z

src/main/java/net/snowflake/ingest/streaming/internal/IcebergParquetValueParser.java

+          sb.append(fieldName.charAt(j));
+        }
+      }
+      String originalFieldName = sb.substring(0, sb.toString().lastIndexOf('_'));


We need this for extra field validation.

sfc-gh-alhuang · 2024-10-18T22:51:52Z

src/main/java/net/snowflake/ingest/utils/IcebergDataTypeParser.java

@@ -154,7 +154,7 @@ public static Type getTypeFromJson(@Nonnull JsonNode jsonNode) {
          field.isObject(), "Cannot parse struct field from non-object: %s", field);

      int id = JsonUtil.getInt(ID, field);
-      String name = JsonUtil.getString(NAME, field);
+      String name = (JsonUtil.getString(NAME, field) + "_" + id).replace("_", "_x5F");


Encode all "_" to its hex, same idea as escape.

sfc-gh-hmadan · 2024-10-21T02:15:16Z

src/main/java/net/snowflake/ingest/streaming/internal/IcebergParquetValueParser.java

    float estimatedParquetSize = 0f;
    for (int i = 0; i < type.getFieldCount(); i++) {
+      StringBuilder sb = new StringBuilder();


Doing this stringbuilder business on the hotpath will make performance even worse.

Have two suggestions:

Option 1:

When parsing the schema, in IcebergDataTypeParser, set the struct's field's name to the field id.

When validating the input row, in IcebergParquetValueParser, (over here next to this comment) : (i) switch the logic to loop over the structVal.Keys collection instead of type.Fields, (ii) for each key, do a lookup to get the field id (you'll have to pipe through a lookup map and maintain it somewhere too), (iii) use this fieldId as the "field name" when doing type.getType. Note that type.getType has an overload that takes in field names.

Option 2:

When parsing the schema, in IcebergDataTypeParser, set the struct's field name to getEscapedString(fieldName) and let AvroSchemaUtil do its thing on whatever we pass in.

Whan validating the input row, in IcebergParquetValueParser, (i) switch logic to loop over structVal.Keys instead of type.Fields, (ii) for each key, call type.getFieldName(getEscapedString(key))

getEscapedString(String str) should be implemented as: return str.replace("_", "_x" + Integer.toHexString("_").toUpperCase())
This combined with AvroSchemaUtil's internal behavior to escape every non-digit, non-letter, non-underscore character, will make sure we are able to generate safe non-colliding names. tldr with this fix we're making sure everything except digits and numbers is getting escaped properly, and effectively fixing AvroSchemaUtil's bug
In both cases, switching what we iterate on will need to be accompanied by fixes in how extraFields collection's bookkeeping is done.

I prefer option 1 because it completely sidesteps AvroSchemaUtil's buggy behavior. But i'm not sure if there is an easy way to maintain this per-struct-type map somewhere and handle all nested struct situations where there's a series of structs embedded inside each other. Still putting it out here in case you can come up with something.

Sounds good, will try option 1

sfc-gh-alhuang added 2 commits October 17, 2024 16:08

done

5916907

tmp

f61bec6

sfc-gh-alhuang commented Oct 18, 2024

View reviewed changes

sfc-gh-hmadan reviewed Oct 21, 2024

View reviewed changes

sfc-gh-alhuang closed this Oct 22, 2024

sfc-gh-alhuang deleted the alhuang/tmp branch October 22, 2024 17:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Do not merg] TMP #866

[Do not merg] TMP #866

sfc-gh-alhuang commented Oct 18, 2024

sfc-gh-alhuang Oct 18, 2024

sfc-gh-alhuang Oct 18, 2024

sfc-gh-hmadan Oct 21, 2024

sfc-gh-alhuang Oct 21, 2024

[Do not merg] TMP #866

[Do not merg] TMP #866

Conversation

sfc-gh-alhuang commented Oct 18, 2024

sfc-gh-alhuang Oct 18, 2024

Choose a reason for hiding this comment

sfc-gh-alhuang Oct 18, 2024

Choose a reason for hiding this comment

sfc-gh-hmadan Oct 21, 2024

Choose a reason for hiding this comment

sfc-gh-alhuang Oct 21, 2024

Choose a reason for hiding this comment