Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance the nested type access for Generic and DuckDB dialect #1541

Closed

Conversation

goldmedal
Copy link
Contributor

@goldmedal goldmedal commented Nov 21, 2024

Description

select c1[1].f1.f2[2] from t1
  • make support_period_map_access_key configurable.

TODO items

  • allow the SQL access the map element from a function call: SELECT named_struct('a', 1, 'b', 2).a

@goldmedal
Copy link
Contributor Author

We should refactor to integrate Subscript and MapAccessSyntax::Bracket, as they are similar and difficult to distinguish at the SQL layer.

For example, DuckDB allows the following SQL, which is currently invalid in our parser:

select [MAP {2:{a:1, b:[2,3,4]}}, MAP {2:{a:1, b:[2,3,4]}}][1][2][1].b[1:2] as result;
┌─────────┐
│ result  │
│ int32[] │
├─────────┤
│ [2, 3]  │
└─────────┘

The access pattern [1][2][1].b[1:2] includes both Subscript and MapAccess. We cannot parse all these as a MapAccessKey following the . token.

MapAccess {
column: Box<Expr>,
keys: Vec<MapAccessKey>,
},

@goldmedal goldmedal marked this pull request as ready for review November 21, 2024 09:13
Comment on lines +2938 to +2943
let expr = self.parse_multi_dim_subscript(expr)?;
if self.dialect.support_period_map_access_key() {
self.parse_map_access(expr, vec![])
} else {
Ok(expr)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah to your comment I'm thinking it makes sense to already in this PR merge the subscript behavior/representation into mapaccess? thinking that looks like it'll resolve both issues and adding a new dialect flag and extending the two codepaths compounds the issue it seems.

Copy link
Contributor Author

@goldmedal goldmedal Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank @iffyio. Indeed, if we merge them in this PR, we can fix many things. It could be a big refactor 🤔
I have two candidate proposals for it:

Merge Subscript into MapAcess and rename MapAccess

  • Remove the Expr::Subscript and add a new MapAccessSyntax::Slice for [1:5] SQL
  • Rename MapAcess to ElementAccess for the elements access of Map and Array.

Remove MapAccess and integrate with CompositeAccess

CompositeAccess is a syntax structure for expr1.expr2. I think we can use it to represent the period map access. We can use CompositeAccess and Subscript to present the access chain like expr1.expr2[1].expr3

CompositeAccess (
    CompositeAccess (expr1,
        CompositeAccess( expr2 ,
            Subscript(1))), Indent(expr3))

Then, we don't need MapAccess for the chain.

What do you think? Which one do you prefer?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first option to merge subscript into mapaccess sounds reasonable! I'm thinking we could skip the rename at least to start with to keep the breakage minimal and I'm imagining it shouldn't be as large of a change in that case, wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first option to merge subscript into mapaccess sounds reasonable! I'm thinking we could skip the rename at least to start with to keep the breakage minimal and I'm imagining it shouldn't be as large of a change in that case, wdyt?

I think removing Subscript causes more breakage than removing MapAccess. 🤔
Do you know how many downstream projects use MapAccess? I found that DataFusion hasn’t implemented it, but Subscript is used to handle array syntax. I'm not sure which one is better.

I drafted a version for option 2: #1551.
It still has some issues with Expr::Method parsing, but I think it preserves Subscript and avoids a significant breaking change for downstream projects.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, #1551 can also support the syntax like

SELECT named_struct('a', 1, 'b', 2).a,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I think if we keep subscript is we should likely change its format, iirc it uses a nested representation which I don't think is necessary given we only need to encode a linear field access chain? i.e. that it uses a list similar to mapaccess

I prefer to keep the nested representation for the following reasons:

  • It preserves the original syntax for arrays (a[1]) and maps (a['field']).
  • By combining Subscript and CompositeAccess (and possibly Method, if needed 🤔), we can cover the entire syntax of access chains without requiring users to introduce additional Expr. This makes the SQL syntax more stable. Some examples include:
a.b.c  # may duplicate with CompoundIdentifier
a[1].b.c
a.b[1:5].c
a.b['f1'].c
a[1][2].b[3].c
array_func()[1].a.b
struct_func().a[1].b
# if Expr::Method is enabled
struct_func().method1().method2()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Not sure I fully followed what we mean by preserving syntax, do we gain from it from a AST-consuming pov you mean?
  2. Do you mean that those won't be possible if the representation is linear?

I was primarily thinking it would be easier and more efficient to traverse the ast if the above variants are expressed as a chain without nesting. Currently, both the parser and downstream crates tend to struggle with the recursive nature when there's a lot of nesting going on in large/complex sql.

Offhand, not sure I have a full picture of either approach though, its not super clear to me what the disadvantage would be with linear, or if there are advantages to using a nested representation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Not sure I fully followed what we mean by preserving syntax, do we gain from it from a AST-consuming pov you mean?

It's just my assumption. If I'm wrong, please correct me.
I mean if we accept the nested presentation, we don't need to change Subscript and won't make breaking changes for the array or map syntax. I'm not sure about other downstream projects. At least, DataFusion won't be broken. If it would cause big breaking change, I think reconsidering about it 🤔

  1. Do you mean that those won't be possible if the representation is linear?

When working on this issue, I found we implemented many different Expr for similar purposes (access chain). For example

  • MapAccess for a[1][2][3] or a[1].b.c[3]
  • Subscript for a[1] or a[1][2], ...
  • JsonAccess for a.b[0].c, ...
  • CompoundIdentifier for a.b.c
  • CompositeAccess for ( .. ).a
  • Method for func1().func2().func3()

They are the same thing, with different combinations and orderings, and maybe for various dialects. I think it also means we have various code paths for the same purposes.
I hope to use some basic components to present all of them.

I think it's possible to use a linear representation to do it but it could make a huge breaking change.

I was primarily thinking it would be easier and more efficient to traverse the ast if the above variants are expressed as a chain without nesting. Currently, both the parser and downstream crates tend to struggle with the recursive nature when there's a lot of nesting going on in large/complex sql.

Offhand, not sure I have a full picture of either approach though, its not super clear to me what the disadvantage would be with linear, or if there are advantages to using a nested representation

It makes sense to me. Indeed, the complex nested representation is a potential issue for performance or usage 🤔
I tried to draft a new linear representation like:

pub enum Expr {
    ...
    // The haed will be concant with all the access fields using the dot.
    //
    // a[1] is present to CompoundExpr { head: Identifer(a), chain: [Subscript(Script(1))] }
    // a.b[1].c is present to CompoundExpr { head: Identifer(a), chain: [Expr(Ident(b)), Subscript(SubScirpt(1)), Expr(Ident(b))]
    // func1().a is present to CompoundExpr { head: FuntionCall(func1, ..), chain: [Expr(Ident(b))] }
    CompoundExpr {
        head: Box<Expr>,
        chain: Vec<AccessField>,
    },
   ...
}

#[derive(Debug, Clone, PartialEq, PartialOrd, Eq, Ord, Hash)]
#[cfg_attr(feature = "serde", derive(Serialize, Deserialize))]
#[cfg_attr(feature = "visitor", derive(Visit, VisitMut))]
pub enum AccessField {
    Expr(Expr),
    SubScript(Subscript)
}

I think it can cover many cases of the access chain. I'm not sure about naming but I don't prefer to keep using Expr::SubScript or Expr::MapAcces because it has turned to a different meaning. I prefer to remove both of them.

What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice yeah I think the CompoundExpr example to represent the different variants would make a lot of sense!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice yeah I think the CompoundExpr example to represent the different variants would make a lot of sense!

Thanks. I'll follow this design in #1551. Let's close this PR 👍

@goldmedal
Copy link
Contributor Author

Move to another design #1551 and see #1541 (comment) for the detail.

@goldmedal goldmedal closed this Nov 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve CompositeAccess to enhance the struct type
2 participants