Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-465: Clarify backward-compatibility rules on LIST type #466

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

wgtmac
Copy link
Member

@wgtmac wgtmac commented Oct 30, 2024

Rationale for this change

The C++ reader of parquet-cpp is having a hard time to read Parquet file written by parquet-java with parquet.avro.write-old-list-structure=true and schema below:

optional group a (LIST) {
  repeated group array (LIST) {
    repeated int32 array;
  }
}

See apache/arrow#43994 and apache/arrow#43995

What changes are included in this PR?

Clarify the rules and add an example.

Do these changes have PoC implementations?

Not required

Closes #465

@wgtmac wgtmac force-pushed the old-list-structure branch 3 times, most recently from abc3d1d to d9579bc Compare October 30, 2024 03:23
@mapleFU
Copy link
Member

mapleFU commented Oct 30, 2024

BTW, can you also point out the Java code in this pr?

LogicalTypes.md Outdated Show resolved Hide resolved
@mapleFU
Copy link
Member

mapleFU commented Oct 30, 2024

Other LGTM but I think it worths issue a disscussion...

@wgtmac
Copy link
Member Author

wgtmac commented Oct 30, 2024

I have sent a discussion thread to the dev ML. It would be good if you can take a look. Thanks! @emkornfield @pitrou @gszadovszky @rdblue @etseidl @clairemcginty

LogicalTypes.md Outdated Show resolved Hide resolved
Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a nice clarification to me. One little (ignorable) nit.

I'm also wondering if we should add an example above for the non-LIST annotated repeating fields, or should that be a new PR?

LogicalTypes.md Outdated Show resolved Hide resolved
LogicalTypes.md Outdated
repeated type is the element type and elements are required.
element type and elements are required. In this case, the element type is
a Struct type with multiple fields.
3. If the repeated field is a group (without annotation) with one `required` or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might leave the entire block as it was previously and clarify that:

  1. Without annotatation the type is a single struct/tuple with the element in it.
  2. With a logical type annotation (I think only LIST and MAP apply here).

I also think we need to update the documentation above:

The outer-most level must be a group annotated with `LIST` that contains a
  single field named `list`. The repetition of this level must be either
  `optional` or `required` and determines whether the list is nullable.

To point to the fact that for backwards compatibility it can also be repeated.

Copy link
Member Author

@wgtmac wgtmac Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To point to the fact that for backwards compatibility it can also be repeated.

The original statement is all about the formal three-level encoding. Adding repeated to it may further confuse readers. That's why I tried to clarify this in the backward compatibility section.

@wgtmac

This comment was marked as resolved.

LogicalTypes.md Outdated
Some existing data does not include the inner element layer. For
backward-compatibility, the type of elements in `LIST`-annotated structures
should always be determined by the following rules:
Some existing data does not include the inner element layer, meaning that
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added more explanation according to the ML discussion. I'm not a native speaker, please help me check if these are suitable. Thanks! @rdblue @gszadovszky

LogicalTypes.md Outdated Show resolved Hide resolved
LogicalTypes.md Outdated Show resolved Hide resolved
LogicalTypes.md Outdated Show resolved Hide resolved
Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying! This is so tricky especially rule 3 and rule 4 in 2-level lists...

```

For backward-compatibility, the type of elements in `LIST`-annotated 2-level
structures should always be determined by the following rules:

1. If the repeated field is not a group, then its type is the element type and
elements are required.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A unrelated: I just find this "elements are required" is a bit tricky for me...

field's repetition.
2. If the repeated field is a group with multiple fields, then its type (Struct
type with multiple fields) is the element type and elements are required.
3. If the repeated field is a group with one `required` or `optional` field,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with only one?

optional group my_list (LIST) {
repeated group my_list_tuple {
required binary str (STRING);
};
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have sample for Rule 5?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I have explained in https://lists.apache.org/thread/s6b25j3x26009v054yqjov0f1z49ctqj, the only case left for rule 5 is as below:

optional group foo (LIST) {
  repeated group bar {
    repeated TYPE baz;
  };
}

It should be resolved to List<Struct<List<TYPE>>> if the 1-level structure is allowed, or an invalid case at all.

and is named either `array` or uses the `LIST`-annotated group's name with
`_tuple` appended, then the repeated type (Struct type with single field) is
the element type and elements are required.
4. If the repeated field is a `LIST`-annotated group with one `repeated` field,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At one point I thought you had rule 3 specify that the repeated group is unannotated. Without that, should rules 3 and 4 be swapped?

Copy link
Member Author

@wgtmac wgtmac Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, rule 3 is for a group with one required or optional field while rule 4 is for a LIST-annotated group with one repeated field. They do not overlap and the rule 3 implies that the group cannot have any annotation. So their orders do not matter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks.

LogicalTypes.md Outdated
Comment on lines 724 to 728
##### 1-level structure without `LIST` annotation

Some existing data does not even have the `LIST` annotation and simply uses
`repeated` repetition to annotate the element type. In this case, the element
type MUST be a primitive type and both the list and elements are required.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if this belongs in the backwards compatibility section. AFAICT using repeated without the LIST annotation is still supported by the spec (and some writers can still produce it, including arrow-rs; cc @zeevm).

Also, I don't think the current wording requires a primitive type. The following should be allowed (List<Struct<Integer, String>> non-null list, non-null elements):

repeated group list_struct {
  required int32 id;
  optional binary val (STRING);
}

In fact, in parquet-testing there's a file repeated_no_annotation.parquet with an unannotated repeated group:

message user {
  REQUIRED INT32 id;
  OPTIONAL group phoneNumbers {
    REPEATED group phone {
      REQUIRED INT64 number;
      OPTIONAL BYTE_ARRAY kind (UTF8);
    }
  }
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT using repeated without the LIST annotation is still supported by the spec

I don't think it is supported by the spec because it is unclear yet. The official list type in the spec is the LIST-annotated group with 3-level structure, which support arbitrary nesting and full capability to specify nullability of each level. The LIST-annotated group with 3-level structure should always be used by writers and others fall into the category of backward compatibility to deal with existing files. A writer can accidentally produce such files does not mean it should be that way.

Also, I don't think the current wording requires a primitive type

That's true. Let me change this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#17 is an interesting read, and explains why the unannotated version exists (allows direct conversion from protobuf). I would prefer there be a single, canonical way to represent lists, so perhaps we should bring up deprecating the direct form on the ML. I don't know how relevant it is now (and in fact only learned of its existence last week).

Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for being patient with me Gang 😄. And thanks for taking on this needed clarification.

LogicalTypes.md Outdated Show resolved Hide resolved
Co-authored-by: Ed Seidl <[email protected]>

Some existing data does not include the inner element layer, meaning that `LIST`
annotates a 2-level structure. In contrast to 3-level structure, the repetition
of 2-level structure can be `optional`, `required`, or `repeated`.
Copy link
Contributor

@rdblue rdblue Nov 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this confusing. Why does it need to change?

The problem isn't when there is a 2-level structure like the one in the new example. The problem is when we need to decide whether a structure is a 2-level or 3-level.

optional group my_list (LIST) {
repeated group element {
required binary str (STRING);
required int32 num;
};
}

// List<OneTuple<String>> (nullable list, non-null elements)
// Rule 3: List<Struct<String>> (nullable list, non-null elements)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes the example less clear.

4. Otherwise, the repeated field's type is the element type with the repeated
field's repetition.
2. If the repeated field is a group with multiple fields, then its type (Struct
type with multiple fields) is the element type and elements are required.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is a "struct"? Do you mean the group with multiple fields? If so, then this is redundant isn't it?

field's repetition.
2. If the repeated field is a group with multiple fields, then its type (Struct
type with multiple fields) is the element type and elements are required.
3. If the repeated field is a group with one `required` or `optional` field,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This rule should not be modified with new cases. I think we need to insert a new rule that states if the repeated field is a group with a repeated field, the repeated field is the element type because the type cannot be a 3-level list.

The rules should be as independent as possible and I think it makes it more confusing to mix them together like this and like the new #4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Backward-compatibility rules on LIST type is unclear
5 participants