Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Do not merg] TMP #866

Closed
wants to merge 2 commits into from
Closed

[Do not merg] TMP #866

wants to merge 2 commits into from

Conversation

sfc-gh-alhuang
Copy link
Contributor

No description provided.

sb.append(fieldName.charAt(j));
}
}
String originalFieldName = sb.substring(0, sb.toString().lastIndexOf('_'));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need this for extra field validation.

@@ -154,7 +154,7 @@ public static Type getTypeFromJson(@Nonnull JsonNode jsonNode) {
field.isObject(), "Cannot parse struct field from non-object: %s", field);

int id = JsonUtil.getInt(ID, field);
String name = JsonUtil.getString(NAME, field);
String name = (JsonUtil.getString(NAME, field) + "_" + id).replace("_", "_x5F");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Encode all "_" to its hex, same idea as escape.

float estimatedParquetSize = 0f;
for (int i = 0; i < type.getFieldCount(); i++) {
StringBuilder sb = new StringBuilder();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing this stringbuilder business on the hotpath will make performance even worse.

Have two suggestions:

Option 1:

  • When parsing the schema, in IcebergDataTypeParser, set the struct's field's name to the field id.
  • When validating the input row, in IcebergParquetValueParser, (over here next to this comment) : (i) switch the logic to loop over the structVal.Keys collection instead of type.Fields, (ii) for each key, do a lookup to get the field id (you'll have to pipe through a lookup map and maintain it somewhere too), (iii) use this fieldId as the "field name" when doing type.getType. Note that type.getType has an overload that takes in field names.

Option 2:

  • When parsing the schema, in IcebergDataTypeParser, set the struct's field name to getEscapedString(fieldName) and let AvroSchemaUtil do its thing on whatever we pass in.
  • Whan validating the input row, in IcebergParquetValueParser, (i) switch logic to loop over structVal.Keys instead of type.Fields, (ii) for each key, call type.getFieldName(getEscapedString(key))
  • getEscapedString(String str) should be implemented as: return str.replace("_", "_x" + Integer.toHexString("_").toUpperCase())
    This combined with AvroSchemaUtil's internal behavior to escape every non-digit, non-letter, non-underscore character, will make sure we are able to generate safe non-colliding names. tldr with this fix we're making sure everything except digits and numbers is getting escaped properly, and effectively fixing AvroSchemaUtil's bug
    In both cases, switching what we iterate on will need to be accompanied by fixes in how extraFields collection's bookkeeping is done.

I prefer option 1 because it completely sidesteps AvroSchemaUtil's buggy behavior. But i'm not sure if there is an easy way to maintain this per-struct-type map somewhere and handle all nested struct situations where there's a series of structs embedded inside each other. Still putting it out here in case you can come up with something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, will try option 1

@sfc-gh-alhuang sfc-gh-alhuang deleted the alhuang/tmp branch October 22, 2024 17:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants