Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance the nested type access for Generic and DuckDB dialect #1541

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions src/dialect/bigquery.rs
Original file line number Diff line number Diff line change
Expand Up @@ -72,4 +72,8 @@ impl Dialect for BigQueryDialect {
fn require_interval_qualifier(&self) -> bool {
true
}

fn support_period_map_access_key(&self) -> bool {
true
}
}
5 changes: 5 additions & 0 deletions src/dialect/duckdb.rs
Original file line number Diff line number Diff line change
Expand Up @@ -75,4 +75,9 @@ impl Dialect for DuckDbDialect {
fn supports_load_extension(&self) -> bool {
true
}

/// See DuckDB <https://duckdb.org/docs/sql/data_types/struct.html#retrieving-from-structs>
fn support_period_map_access_key(&self) -> bool {
true
}
}
4 changes: 4 additions & 0 deletions src/dialect/generic.rs
Original file line number Diff line number Diff line change
Expand Up @@ -123,4 +123,8 @@ impl Dialect for GenericDialect {
fn supports_named_fn_args_with_assignment_operator(&self) -> bool {
true
}

fn support_period_map_access_key(&self) -> bool {
true
}
}
9 changes: 9 additions & 0 deletions src/dialect/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -675,6 +675,15 @@ pub trait Dialect: Debug + Any {
fn supports_create_table_select(&self) -> bool {
false
}

/// Return true if the dialect supports the period map access key
///
/// Access on BigQuery nested and repeated expressions can
/// mix notations in the same expression.
/// <https://cloud.google.com/bigquery/docs/nested-repeated#query_nested_and_repeated_columns>
fn support_period_map_access_key(&self) -> bool {
false
}
}

/// This represents the operators for which precedence must be defined
Expand Down
64 changes: 46 additions & 18 deletions src/parser/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -2935,12 +2935,23 @@ impl<'a> Parser<'a> {
})
} else if Token::LBracket == tok {
if dialect_of!(self is PostgreSqlDialect | DuckDbDialect | GenericDialect) {
self.parse_subscript(expr)
let expr = self.parse_multi_dim_subscript(expr)?;
if self.dialect.support_period_map_access_key() {
self.parse_map_access(expr, vec![])
} else {
Ok(expr)
}
Comment on lines +2938 to +2943
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah to your comment I'm thinking it makes sense to already in this PR merge the subscript behavior/representation into mapaccess? thinking that looks like it'll resolve both issues and adding a new dialect flag and extending the two codepaths compounds the issue it seems.

Copy link
Contributor Author

@goldmedal goldmedal Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank @iffyio. Indeed, if we merge them in this PR, we can fix many things. It could be a big refactor 🤔
I have two candidate proposals for it:

Merge Subscript into MapAcess and rename MapAccess

  • Remove the Expr::Subscript and add a new MapAccessSyntax::Slice for [1:5] SQL
  • Rename MapAcess to ElementAccess for the elements access of Map and Array.

Remove MapAccess and integrate with CompositeAccess

CompositeAccess is a syntax structure for expr1.expr2. I think we can use it to represent the period map access. We can use CompositeAccess and Subscript to present the access chain like expr1.expr2[1].expr3

CompositeAccess (
    CompositeAccess (expr1,
        CompositeAccess( expr2 ,
            Subscript(1))), Indent(expr3))

Then, we don't need MapAccess for the chain.

What do you think? Which one do you prefer?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first option to merge subscript into mapaccess sounds reasonable! I'm thinking we could skip the rename at least to start with to keep the breakage minimal and I'm imagining it shouldn't be as large of a change in that case, wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first option to merge subscript into mapaccess sounds reasonable! I'm thinking we could skip the rename at least to start with to keep the breakage minimal and I'm imagining it shouldn't be as large of a change in that case, wdyt?

I think removing Subscript causes more breakage than removing MapAccess. 🤔
Do you know how many downstream projects use MapAccess? I found that DataFusion hasn’t implemented it, but Subscript is used to handle array syntax. I'm not sure which one is better.

I drafted a version for option 2: #1551.
It still has some issues with Expr::Method parsing, but I think it preserves Subscript and avoids a significant breaking change for downstream projects.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, #1551 can also support the syntax like

SELECT named_struct('a', 1, 'b', 2).a,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I think if we keep subscript is we should likely change its format, iirc it uses a nested representation which I don't think is necessary given we only need to encode a linear field access chain? i.e. that it uses a list similar to mapaccess

I prefer to keep the nested representation for the following reasons:

  • It preserves the original syntax for arrays (a[1]) and maps (a['field']).
  • By combining Subscript and CompositeAccess (and possibly Method, if needed 🤔), we can cover the entire syntax of access chains without requiring users to introduce additional Expr. This makes the SQL syntax more stable. Some examples include:
a.b.c  # may duplicate with CompoundIdentifier
a[1].b.c
a.b[1:5].c
a.b['f1'].c
a[1][2].b[3].c
array_func()[1].a.b
struct_func().a[1].b
# if Expr::Method is enabled
struct_func().method1().method2()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Not sure I fully followed what we mean by preserving syntax, do we gain from it from a AST-consuming pov you mean?
  2. Do you mean that those won't be possible if the representation is linear?

I was primarily thinking it would be easier and more efficient to traverse the ast if the above variants are expressed as a chain without nesting. Currently, both the parser and downstream crates tend to struggle with the recursive nature when there's a lot of nesting going on in large/complex sql.

Offhand, not sure I have a full picture of either approach though, its not super clear to me what the disadvantage would be with linear, or if there are advantages to using a nested representation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Not sure I fully followed what we mean by preserving syntax, do we gain from it from a AST-consuming pov you mean?

It's just my assumption. If I'm wrong, please correct me.
I mean if we accept the nested presentation, we don't need to change Subscript and won't make breaking changes for the array or map syntax. I'm not sure about other downstream projects. At least, DataFusion won't be broken. If it would cause big breaking change, I think reconsidering about it 🤔

  1. Do you mean that those won't be possible if the representation is linear?

When working on this issue, I found we implemented many different Expr for similar purposes (access chain). For example

  • MapAccess for a[1][2][3] or a[1].b.c[3]
  • Subscript for a[1] or a[1][2], ...
  • JsonAccess for a.b[0].c, ...
  • CompoundIdentifier for a.b.c
  • CompositeAccess for ( .. ).a
  • Method for func1().func2().func3()

They are the same thing, with different combinations and orderings, and maybe for various dialects. I think it also means we have various code paths for the same purposes.
I hope to use some basic components to present all of them.

I think it's possible to use a linear representation to do it but it could make a huge breaking change.

I was primarily thinking it would be easier and more efficient to traverse the ast if the above variants are expressed as a chain without nesting. Currently, both the parser and downstream crates tend to struggle with the recursive nature when there's a lot of nesting going on in large/complex sql.

Offhand, not sure I have a full picture of either approach though, its not super clear to me what the disadvantage would be with linear, or if there are advantages to using a nested representation

It makes sense to me. Indeed, the complex nested representation is a potential issue for performance or usage 🤔
I tried to draft a new linear representation like:

pub enum Expr {
    ...
    // The haed will be concant with all the access fields using the dot.
    //
    // a[1] is present to CompoundExpr { head: Identifer(a), chain: [Subscript(Script(1))] }
    // a.b[1].c is present to CompoundExpr { head: Identifer(a), chain: [Expr(Ident(b)), Subscript(SubScirpt(1)), Expr(Ident(b))]
    // func1().a is present to CompoundExpr { head: FuntionCall(func1, ..), chain: [Expr(Ident(b))] }
    CompoundExpr {
        head: Box<Expr>,
        chain: Vec<AccessField>,
    },
   ...
}

#[derive(Debug, Clone, PartialEq, PartialOrd, Eq, Ord, Hash)]
#[cfg_attr(feature = "serde", derive(Serialize, Deserialize))]
#[cfg_attr(feature = "visitor", derive(Visit, VisitMut))]
pub enum AccessField {
    Expr(Expr),
    SubScript(Subscript)
}

I think it can cover many cases of the access chain. I'm not sure about naming but I don't prefer to keep using Expr::SubScript or Expr::MapAcces because it has turned to a different meaning. I prefer to remove both of them.

What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice yeah I think the CompoundExpr example to represent the different variants would make a lot of sense!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice yeah I think the CompoundExpr example to represent the different variants would make a lot of sense!

Thanks. I'll follow this design in #1551. Let's close this PR 👍

} else if dialect_of!(self is SnowflakeDialect) {
self.prev_token();
self.parse_json_access(expr)
} else {
self.parse_map_access(expr)
let key = self.parse_expr()?;
self.expect_token(&Token::RBracket)?;
let keys = vec![MapAccessKey {
key,
syntax: MapAccessSyntax::Bracket,
}];
self.parse_map_access(expr, keys)
}
} else if dialect_of!(self is SnowflakeDialect | GenericDialect) && Token::Colon == tok {
self.prev_token();
Expand Down Expand Up @@ -3036,6 +3047,19 @@ impl<'a> Parser<'a> {
})
}

/// Parse an multi-dimension array accessing like `[1:3][1][1]`
///
/// Parser is right after the first `[`
pub fn parse_multi_dim_subscript(&mut self, mut expr: Expr) -> Result<Expr, ParserError> {
loop {
expr = self.parse_subscript(expr)?;
if !self.consume_token(&Token::LBracket) {
break;
}
}
Ok(expr)
}

/// Parses an array subscript like `[1:3]`
///
/// Parser is right after `[`
Expand Down Expand Up @@ -3101,14 +3125,15 @@ impl<'a> Parser<'a> {
})
}

pub fn parse_map_access(&mut self, expr: Expr) -> Result<Expr, ParserError> {
let key = self.parse_expr()?;
self.expect_token(&Token::RBracket)?;

let mut keys = vec![MapAccessKey {
key,
syntax: MapAccessSyntax::Bracket,
}];
/// Parse the map access like `[key]` or `.key` if [Dialect::support_period_map_access_key] is true
/// It could be an access-chain like `[key1][key2].key3`
///
/// The parameter `keys` is an initialized buffer that could contain some keys parsed from other places.
pub fn parse_map_access(
&mut self,
expr: Expr,
mut keys: Vec<MapAccessKey>,
) -> Result<Expr, ParserError> {
loop {
let key = match self.peek_token().token {
Token::LBracket => {
Expand All @@ -3120,10 +3145,7 @@ impl<'a> Parser<'a> {
syntax: MapAccessSyntax::Bracket,
}
}
// Access on BigQuery nested and repeated expressions can
// mix notations in the same expression.
// https://cloud.google.com/bigquery/docs/nested-repeated#query_nested_and_repeated_columns
Token::Period if dialect_of!(self is BigQueryDialect) => {
Token::Period if self.dialect.support_period_map_access_key() => {
self.next_token(); // consume `.`
MapAccessKey {
key: self.parse_expr()?,
Expand All @@ -3135,10 +3157,16 @@ impl<'a> Parser<'a> {
keys.push(key);
}

Ok(Expr::MapAccess {
column: Box::new(expr),
keys,
})
// If no any key be collected, it means the elements have been parsed to [Subscript]
// e.g. `select abc[1]` or `select abc[1][2]`
if keys.is_empty() {
Ok(expr)
} else {
Ok(Expr::MapAccess {
column: Box::new(expr),
keys,
})
}
}

/// Parses the parens following the `[ NOT ] IN` operator.
Expand Down
22 changes: 22 additions & 0 deletions tests/sqlparser_common.rs
Original file line number Diff line number Diff line change
Expand Up @@ -12161,3 +12161,25 @@ fn parse_create_table_select() {
);
}
}

#[test]
fn test_period_map_access() {
let supported_dialects = TestedDialects::new(vec![
Box::new(GenericDialect {}),
Box::new(DuckDbDialect {}),
]);
let sqls = [
"SELECT abc[1] FROM t",
"SELECT abc[1].f1 FROM t",
"SELECT abc[1].f1.f2 FROM t",
"SELECT f1.abc[1] FROM t",
"SELECT f1.f2.abc[1] FROM t",
"SELECT f1.abc[1].f2 FROM t",
"SELECT abc['a'][1].f1 FROM t",
"SELECT abc['a'].f1[1].f2 FROM t",
"SELECT abc['a'].f1[1].f2[2] FROM t",
];
for sql in sqls {
supported_dialects.verified_stmt(sql);
}
}
Loading