-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Do not merg] TMP #866
[Do not merg] TMP #866
Conversation
sb.append(fieldName.charAt(j)); | ||
} | ||
} | ||
String originalFieldName = sb.substring(0, sb.toString().lastIndexOf('_')); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need this for extra field validation.
@@ -154,7 +154,7 @@ public static Type getTypeFromJson(@Nonnull JsonNode jsonNode) { | |||
field.isObject(), "Cannot parse struct field from non-object: %s", field); | |||
|
|||
int id = JsonUtil.getInt(ID, field); | |||
String name = JsonUtil.getString(NAME, field); | |||
String name = (JsonUtil.getString(NAME, field) + "_" + id).replace("_", "_x5F"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Encode all "_" to its hex, same idea as escape.
float estimatedParquetSize = 0f; | ||
for (int i = 0; i < type.getFieldCount(); i++) { | ||
StringBuilder sb = new StringBuilder(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing this stringbuilder business on the hotpath will make performance even worse.
Have two suggestions:
Option 1:
- When parsing the schema, in IcebergDataTypeParser, set the struct's field's name to the field id.
- When validating the input row, in IcebergParquetValueParser, (over here next to this comment) : (i) switch the logic to loop over the structVal.Keys collection instead of type.Fields, (ii) for each key, do a lookup to get the field id (you'll have to pipe through a lookup map and maintain it somewhere too), (iii) use this fieldId as the "field name" when doing type.getType. Note that type.getType has an overload that takes in field names.
Option 2:
- When parsing the schema, in IcebergDataTypeParser, set the struct's field name to getEscapedString(fieldName) and let AvroSchemaUtil do its thing on whatever we pass in.
- Whan validating the input row, in IcebergParquetValueParser, (i) switch logic to loop over structVal.Keys instead of type.Fields, (ii) for each key, call type.getFieldName(getEscapedString(key))
- getEscapedString(String str) should be implemented as:
return str.replace("_", "_x" + Integer.toHexString("_").toUpperCase())
This combined with AvroSchemaUtil's internal behavior to escape every non-digit, non-letter, non-underscore character, will make sure we are able to generate safe non-colliding names. tldr with this fix we're making sure everything except digits and numbers is getting escaped properly, and effectively fixing AvroSchemaUtil's bug
In both cases, switching what we iterate on will need to be accompanied by fixes in how extraFields collection's bookkeeping is done.
I prefer option 1 because it completely sidesteps AvroSchemaUtil's buggy behavior. But i'm not sure if there is an easy way to maintain this per-struct-type map somewhere and handle all nested struct situations where there's a series of structs embedded inside each other. Still putting it out here in case you can come up with something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, will try option 1
No description provided.