Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix reordering avro nullable types in MergeHiveSchemaWithAvro #140

Open
wants to merge 1 commit into
base: li-0.11.x
Choose a base branch
from

Conversation

rzhang10
Copy link
Member

@rzhang10 rzhang10 commented Mar 1, 2023

This patch fixes a bug where the avro nullable type orders (the order of null and the other type's order) is wrongly flipped, resulting in read failure.

The fix is demonstrated by the new unit test shouldNotReorderListElementType

@github-actions github-actions bot added the CORE label Mar 1, 2023
@rzhang10 rzhang10 requested a review from shardulm94 March 1, 2023 02:31
@rzhang10 rzhang10 force-pushed the fix_avro_reordering_ branch from cb095b2 to d58b362 Compare March 1, 2023 02:33
@aastha25
Copy link

aastha25 commented Mar 1, 2023

LGTM, thanks for the PR @rzhang10.
Please link the JIRA ticket in the description, if available.

@@ -138,6 +138,15 @@ public static boolean isOptionSchema(Schema schema) {
return false;
}

public static int getNullIndex(Schema schema) {
Copy link
Contributor

@shardulm94 shardulm94 Mar 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nullIndexForOptionSchema

Iceberg does not use get in method names. nulls can also be present in complex schemas, so ensure that we add the OptionSchema constraint in the method name itself.

if (AvroSchemaUtil.isOptionSchema(schema)) {
int nullIndex = AvroSchemaUtil.getNullIndex(schema);
if (defaultValue != null && !defaultValue.equals(JsonProperties.NULL_VALUE)) {
return Schema.createUnion(schema.getTypes().get(1 - nullIndex), Schema.create(Type.NULL));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @rzhang10 , I am having a hard time reasoning about these changes. Can you please include a description of how these changes help fix the issue. Can you also update the Java doc of this method with the new changes?

Copy link
Contributor

@shardulm94 shardulm94 Mar 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I understand why this is needed. Previously reorderOptionIfRequired only worked correctly when schema was an option schema who first element was null. This was fine because we never returned an option with null as the second element from the children visitors, but now we do. I think the description needs to be updated to reflect the new behavior.

Comment on lines +147 to +154
int nullIndex = 0;
if (partner != null && AvroSchemaUtil.isOptionSchema(partner)) {
nullIndex = AvroSchemaUtil.getNullIndex(partner);
}
Schema hivePrimitive = hivePrimitiveToAvro(primitive);
// if there was no matching Avro primitive, use the Hive primitive
Schema result = partner == null ? hivePrimitive : checkCompatibilityAndPromote(hivePrimitive, partner);
return shouldResultBeOptional ? AvroSchemaUtil.toOption(result) : result;
return shouldResultBeOptional ? AvroSchemaUtil.toOption(result, nullIndex == 1) : result;
Copy link
Contributor

@shardulm94 shardulm94 Mar 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This handling is incomplete, it only handles cases for primitives inside containers (e.g. list), but does not handle similar cases for nested lists, maps, structs, etc. E.g. this case will fail

  @Test
  public void shouldNotReorderListElementType1() {
    String hive = "struct<fa:array<array<int>>>";
    Schema avro =
            SchemaBuilder.record("r1")
                    .fields()
                    .name("fa")
                    .type()
                    .nullable()
                    .array()
                    .items()
                    .nullable()
                    .array()
                    .items()
                    .nullable()
                    .intType()
                    .arrayDefault(Collections.singletonList(Arrays.asList(1, 2, 3)))
                    .endRecord();

    assertSchema(avro, merge(hive, avro));
  }

I think we need similar handling in other places where we use the shouldResultBeOptional pattern, e.g. in the struct, list and map methods. Might be good to refactor the common pattern into a separate method for reuse.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants