-
Notifications
You must be signed in to change notification settings - Fork 543
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhance the nested type access for Generic and DuckDB dialect #1541
Conversation
We should refactor to integrate For example, DuckDB allows the following SQL, which is currently invalid in our parser: select [MAP {2:{a:1, b:[2,3,4]}}, MAP {2:{a:1, b:[2,3,4]}}][1][2][1].b[1:2] as result;
┌─────────┐
│ result │
│ int32[] │
├─────────┤
│ [2, 3] │
└─────────┘ The access pattern datafusion-sqlparser-rs/src/ast/mod.rs Lines 805 to 808 in fad2ddd
|
let expr = self.parse_multi_dim_subscript(expr)?; | ||
if self.dialect.support_period_map_access_key() { | ||
self.parse_map_access(expr, vec![]) | ||
} else { | ||
Ok(expr) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah to your comment I'm thinking it makes sense to already in this PR merge the subscript behavior/representation into mapaccess? thinking that looks like it'll resolve both issues and adding a new dialect flag and extending the two codepaths compounds the issue it seems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank @iffyio. Indeed, if we merge them in this PR, we can fix many things. It could be a big refactor 🤔
I have two candidate proposals for it:
Merge Subscript
into MapAcess
and rename MapAccess
- Remove the
Expr::Subscript
and add a newMapAccessSyntax::Slice
for[1:5]
SQL - Rename
MapAcess
toElementAccess
for the elements access ofMap
andArray
.
Remove MapAccess
and integrate with CompositeAccess
CompositeAccess
is a syntax structure for expr1.expr2
. I think we can use it to represent the period map access. We can use CompositeAccess
and Subscript
to present the access chain like expr1.expr2[1].expr3
CompositeAccess (
CompositeAccess (expr1,
CompositeAccess( expr2 ,
Subscript(1))), Indent(expr3))
Then, we don't need MapAccess
for the chain.
What do you think? Which one do you prefer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first option to merge subscript into mapaccess sounds reasonable! I'm thinking we could skip the rename at least to start with to keep the breakage minimal and I'm imagining it shouldn't be as large of a change in that case, wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first option to merge subscript into mapaccess sounds reasonable! I'm thinking we could skip the rename at least to start with to keep the breakage minimal and I'm imagining it shouldn't be as large of a change in that case, wdyt?
I think removing Subscript
causes more breakage than removing MapAccess
. 🤔
Do you know how many downstream projects use MapAccess
? I found that DataFusion
hasn’t implemented it, but Subscript
is used to handle array syntax. I'm not sure which one is better.
I drafted a version for option 2: #1551.
It still has some issues with Expr::Method
parsing, but I think it preserves Subscript
and avoids a significant breaking change for downstream projects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, #1551 can also support the syntax like
SELECT named_struct('a', 1, 'b', 2).a,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I think if we keep subscript is we should likely change its format, iirc it uses a nested representation which I don't think is necessary given we only need to encode a linear field access chain? i.e. that it uses a list similar to mapaccess
I prefer to keep the nested representation for the following reasons:
- It preserves the original syntax for arrays (
a[1]
) and maps (a['field']
). - By combining
Subscript
andCompositeAccess
(and possiblyMethod
, if needed 🤔), we can cover the entire syntax of access chains without requiring users to introduce additionalExpr
. This makes the SQL syntax more stable. Some examples include:
a.b.c # may duplicate with CompoundIdentifier
a[1].b.c
a.b[1:5].c
a.b['f1'].c
a[1][2].b[3].c
array_func()[1].a.b
struct_func().a[1].b
# if Expr::Method is enabled
struct_func().method1().method2()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Not sure I fully followed what we mean by preserving syntax, do we gain from it from a AST-consuming pov you mean?
- Do you mean that those won't be possible if the representation is linear?
I was primarily thinking it would be easier and more efficient to traverse the ast if the above variants are expressed as a chain without nesting. Currently, both the parser and downstream crates tend to struggle with the recursive nature when there's a lot of nesting going on in large/complex sql.
Offhand, not sure I have a full picture of either approach though, its not super clear to me what the disadvantage would be with linear, or if there are advantages to using a nested representation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Not sure I fully followed what we mean by preserving syntax, do we gain from it from a AST-consuming pov you mean?
It's just my assumption. If I'm wrong, please correct me.
I mean if we accept the nested presentation, we don't need to change Subscript
and won't make breaking changes for the array or map syntax. I'm not sure about other downstream projects. At least, DataFusion won't be broken. If it would cause big breaking change, I think reconsidering about it 🤔
- Do you mean that those won't be possible if the representation is linear?
When working on this issue, I found we implemented many different Expr
for similar purposes (access chain). For example
MapAccess
fora[1][2][3]
ora[1].b.c[3]
Subscript
fora[1]
ora[1][2]
, ...JsonAccess
fora.b[0].c
, ...CompoundIdentifier
fora.b.c
CompositeAccess
for( .. ).a
Method
forfunc1().func2().func3()
They are the same thing, with different combinations and orderings, and maybe for various dialects. I think it also means we have various code paths for the same purposes.
I hope to use some basic components to present all of them.
I think it's possible to use a linear representation to do it but it could make a huge breaking change.
I was primarily thinking it would be easier and more efficient to traverse the ast if the above variants are expressed as a chain without nesting. Currently, both the parser and downstream crates tend to struggle with the recursive nature when there's a lot of nesting going on in large/complex sql.
Offhand, not sure I have a full picture of either approach though, its not super clear to me what the disadvantage would be with linear, or if there are advantages to using a nested representation
It makes sense to me. Indeed, the complex nested representation is a potential issue for performance or usage 🤔
I tried to draft a new linear representation like:
pub enum Expr {
...
// The haed will be concant with all the access fields using the dot.
//
// a[1] is present to CompoundExpr { head: Identifer(a), chain: [Subscript(Script(1))] }
// a.b[1].c is present to CompoundExpr { head: Identifer(a), chain: [Expr(Ident(b)), Subscript(SubScirpt(1)), Expr(Ident(b))]
// func1().a is present to CompoundExpr { head: FuntionCall(func1, ..), chain: [Expr(Ident(b))] }
CompoundExpr {
head: Box<Expr>,
chain: Vec<AccessField>,
},
...
}
#[derive(Debug, Clone, PartialEq, PartialOrd, Eq, Ord, Hash)]
#[cfg_attr(feature = "serde", derive(Serialize, Deserialize))]
#[cfg_attr(feature = "visitor", derive(Visit, VisitMut))]
pub enum AccessField {
Expr(Expr),
SubScript(Subscript)
}
I think it can cover many cases of the access chain. I'm not sure about naming but I don't prefer to keep using Expr::SubScript
or Expr::MapAcces
because it has turned to a different meaning. I prefer to remove both of them.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice yeah I think the CompoundExpr
example to represent the different variants would make a lot of sense!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice yeah I think the
CompoundExpr
example to represent the different variants would make a lot of sense!
Thanks. I'll follow this design in #1551. Let's close this PR 👍
Move to another design #1551 and see #1541 (comment) for the detail. |
Description
CompositeAccess
to enhance the struct type #1533support_period_map_access_key
configurable.TODO items
SELECT named_struct('a', 1, 'b', 2).a