-
Notifications
You must be signed in to change notification settings - Fork 432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-465: Clarify backward-compatibility rules on LIST type #466
base: master
Are you sure you want to change the base?
Conversation
abc3d1d
to
d9579bc
Compare
BTW, can you also point out the Java code in this pr? |
Other LGTM but I think it worths issue a disscussion... |
4551c79
to
f200d34
Compare
I have sent a discussion thread to the dev ML. It would be good if you can take a look. Thanks! @emkornfield @pitrou @gszadovszky @rdblue @etseidl @clairemcginty |
f200d34
to
fc8aca3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a nice clarification to me. One little (ignorable) nit.
I'm also wondering if we should add an example above for the non-LIST annotated repeating fields, or should that be a new PR?
LogicalTypes.md
Outdated
repeated type is the element type and elements are required. | ||
element type and elements are required. In this case, the element type is | ||
a Struct type with multiple fields. | ||
3. If the repeated field is a group (without annotation) with one `required` or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might leave the entire block as it was previously and clarify that:
- Without annotatation the type is a single struct/tuple with the element in it.
- With a logical type annotation (I think only LIST and MAP apply here).
I also think we need to update the documentation above:
The outer-most level must be a group annotated with `LIST` that contains a
single field named `list`. The repetition of this level must be either
`optional` or `required` and determines whether the list is nullable.
To point to the fact that for backwards compatibility it can also be repeated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To point to the fact that for backwards compatibility it can also be repeated.
The original statement is all about the formal three-level encoding. Adding repeated
to it may further confuse readers. That's why I tried to clarify this in the backward compatibility section.
This comment was marked as resolved.
This comment was marked as resolved.
LogicalTypes.md
Outdated
Some existing data does not include the inner element layer. For | ||
backward-compatibility, the type of elements in `LIST`-annotated structures | ||
should always be determined by the following rules: | ||
Some existing data does not include the inner element layer, meaning that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added more explanation according to the ML discussion. I'm not a native speaker, please help me check if these are suitable. Thanks! @rdblue @gszadovszky
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarifying! This is so tricky especially rule 3 and rule 4 in 2-level lists...
``` | ||
|
||
For backward-compatibility, the type of elements in `LIST`-annotated 2-level | ||
structures should always be determined by the following rules: | ||
|
||
1. If the repeated field is not a group, then its type is the element type and | ||
elements are required. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A unrelated: I just find this "elements are required" is a bit tricky for me...
field's repetition. | ||
2. If the repeated field is a group with multiple fields, then its type (Struct | ||
type with multiple fields) is the element type and elements are required. | ||
3. If the repeated field is a group with one `required` or `optional` field, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with only one?
optional group my_list (LIST) { | ||
repeated group my_list_tuple { | ||
required binary str (STRING); | ||
}; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have sample for Rule 5?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I have explained in https://lists.apache.org/thread/s6b25j3x26009v054yqjov0f1z49ctqj, the only case left for rule 5 is as below:
optional group foo (LIST) {
repeated group bar {
repeated TYPE baz;
};
}
It should be resolved to List<Struct<List<TYPE>>>
if the 1-level structure is allowed, or an invalid case at all.
and is named either `array` or uses the `LIST`-annotated group's name with | ||
`_tuple` appended, then the repeated type (Struct type with single field) is | ||
the element type and elements are required. | ||
4. If the repeated field is a `LIST`-annotated group with one `repeated` field, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At one point I thought you had rule 3 specify that the repeated group is unannotated. Without that, should rules 3 and 4 be swapped?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, rule 3 is for a group with one required
or optional
field while rule 4 is for a LIST
-annotated group with one repeated
field. They do not overlap and the rule 3 implies that the group cannot have any annotation. So their orders do not matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thanks.
LogicalTypes.md
Outdated
##### 1-level structure without `LIST` annotation | ||
|
||
Some existing data does not even have the `LIST` annotation and simply uses | ||
`repeated` repetition to annotate the element type. In this case, the element | ||
type MUST be a primitive type and both the list and elements are required. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if this belongs in the backwards compatibility section. AFAICT using repeated
without the LIST
annotation is still supported by the spec (and some writers can still produce it, including arrow-rs; cc @zeevm).
Also, I don't think the current wording requires a primitive type. The following should be allowed (List<Struct<Integer, String>> non-null list, non-null elements
):
repeated group list_struct {
required int32 id;
optional binary val (STRING);
}
In fact, in parquet-testing there's a file repeated_no_annotation.parquet
with an unannotated repeated
group:
message user {
REQUIRED INT32 id;
OPTIONAL group phoneNumbers {
REPEATED group phone {
REQUIRED INT64 number;
OPTIONAL BYTE_ARRAY kind (UTF8);
}
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICT using repeated without the LIST annotation is still supported by the spec
I don't think it is supported by the spec because it is unclear yet. The official list type in the spec is the LIST-annotated group with 3-level structure, which support arbitrary nesting and full capability to specify nullability of each level. The LIST-annotated group with 3-level structure should always be used by writers and others fall into the category of backward compatibility to deal with existing files. A writer can accidentally produce such files does not mean it should be that way.
Also, I don't think the current wording requires a primitive type
That's true. Let me change this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#17 is an interesting read, and explains why the unannotated version exists (allows direct conversion from protobuf). I would prefer there be a single, canonical way to represent lists, so perhaps we should bring up deprecating the direct form on the ML. I don't know how relevant it is now (and in fact only learned of its existence last week).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for being patient with me Gang 😄. And thanks for taking on this needed clarification.
Co-authored-by: Ed Seidl <[email protected]>
|
||
Some existing data does not include the inner element layer, meaning that `LIST` | ||
annotates a 2-level structure. In contrast to 3-level structure, the repetition | ||
of 2-level structure can be `optional`, `required`, or `repeated`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this confusing. Why does it need to change?
The problem isn't when there is a 2-level structure like the one in the new example. The problem is when we need to decide whether a structure is a 2-level or 3-level.
optional group my_list (LIST) { | ||
repeated group element { | ||
required binary str (STRING); | ||
required int32 num; | ||
}; | ||
} | ||
|
||
// List<OneTuple<String>> (nullable list, non-null elements) | ||
// Rule 3: List<Struct<String>> (nullable list, non-null elements) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this makes the example less clear.
4. Otherwise, the repeated field's type is the element type with the repeated | ||
field's repetition. | ||
2. If the repeated field is a group with multiple fields, then its type (Struct | ||
type with multiple fields) is the element type and elements are required. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is a "struct"? Do you mean the group with multiple fields? If so, then this is redundant isn't it?
field's repetition. | ||
2. If the repeated field is a group with multiple fields, then its type (Struct | ||
type with multiple fields) is the element type and elements are required. | ||
3. If the repeated field is a group with one `required` or `optional` field, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This rule should not be modified with new cases. I think we need to insert a new rule that states if the repeated field is a group with a repeated
field, the repeated field is the element type because the type cannot be a 3-level list.
The rules should be as independent as possible and I think it makes it more confusing to mix them together like this and like the new #4.
Rationale for this change
The C++ reader of parquet-cpp is having a hard time to read Parquet file written by parquet-java with
parquet.avro.write-old-list-structure=true
and schema below:See apache/arrow#43994 and apache/arrow#43995
What changes are included in this PR?
Clarify the rules and add an example.
Do these changes have PoC implementations?
Not required
Closes #465