[SPARK-48356][SQL] Support for FOR statement #48794

dusantism-db · 2024-11-07T15:32:34Z

What changes were proposed in this pull request?

In this PR, support for FOR statement in SQL scripting is introduced. Examples:

FOR row AS SELECT * FROM t DO
   SELECT row.intCol;
 END FOR;

FOR SELECT * FROM t DO
   SELECT intCol;
 END FOR;

Implementation notes:
As local variables for SQL scripting are currently a work in progress, session variables are used to simulate them.
When FOR begins executing, session variables are declared for each column in the result set, and optionally for the for variable if it is present ("row" in the example above).
On each iteration, these variables are overwritten with the values from the row currently being iterated.
The variables are dropped upon loop completion.

This means that if a session variable which matches the name of a column in the result set already exists, the for statement will drop that variable after completion. If that variable would be referenced after the for statement, the script would fail as the variable would not exist. This limitation is already present in the current iteration of SQL scripting, and will be fixed once local variables are introduced. Also, with local variables the implementation of for statement will be much simpler.

Grammar/parser changes:
forStatement grammar rule
visitForStatement rule visitor
ForStatement logical operator

Why are the changes needed?

FOR statement is an part of SQL scripting control flow logic.

Does this PR introduce any user-facing change?

No

How was this patch tested?

New tests are introduced to all of the three scripting test suites: SqlScriptingParserSuite, SqlScriptingExecutionNodeSuite and SqlScriptingInterpreterSuite.

Was this patch authored or co-authored using generative AI tooling?

No

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala

sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNode.scala

sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingInterpreter.scala

davidm-db · 2024-11-08T11:47:57Z

sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNode.scala

+    case m: Map[_, _] =>
+      // arguments of CreateMap are in the format: (key1, val1, key2, val2, ...)
+      val mapArgs = m.keys.toSeq.flatMap { key =>
+        Seq(createExpressionFromValue(key), createExpressionFromValue(m(key)))
+      }
+      CreateMap(mapArgs, false)
+    case s: GenericRowWithSchema =>
+    // struct types match this case
+    // arguments of CreateNamedStruct are in the format: (name1, val1, name2, val2, ...)
+    val namedStructArgs = s.schema.names.toSeq.flatMap { colName =>
+        val valueExpression = createExpressionFromValue(s.getAs(colName))
+        Seq(Literal(colName), valueExpression)
+      }
+      CreateNamedStruct(namedStructArgs)
+    case _ => Literal(value)


for my knowledge, can you explain what does the case with the Map means exactly, i.e. when will this happen?
also, how did we check that this is the complete list of the relevant cases?

When Map or Struct are in the result set of the query, we can't use Literal(value) to convert them to expressions because Literals don't support them. So for example for Map we recursively convert both keys and values to expressions first, and then create a map expression using CreateMap. The process is similar for structs.

The way i checked is i went through all the spark data types, and for each checked in code of Literal whether it's supported. I only found these two which are not, however I agree we can't be completely sure, and new types will be added to Spark in the future which Literals may or may not support. Probably I should add an error message for currently unsupported type, in case it comes up. Does that make sense to you?

Yeah, I would say internal error is fine in this case (i.e. no need to introduce new error for this) since it would mean that we have a bug.
Other than that, this sounds fine to me, but let's wait for Max and/or Wenchen to comment on this if they have any concerns.

sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNode.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala

miland-db · 2024-11-19T16:13:53Z

sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNode.scala

+      override def next(): CompoundStatementExec = state match {
+
+        case ForState.VariableAssignment =>
+          variablesMap = createVariablesMapFromRow(cachedQueryResult()(currRow))


Why do we need to create this every time? Can we fill variablesMap once and then reuse it?

We need to create it every time because the map is different for every row in the result set. You can see we call it on the currRow.

davidm-db · 2024-11-20T11:45:30Z

Can we rebase first to include already merged changes regarding the label checks, logical plans, etc? And I'll review afterwards again?

dusantism-db · 2024-11-20T15:34:03Z

@davidm-db @miland-db Rebased, you can review again

sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNode.scala

davidm-db · 2024-11-20T19:02:23Z

sql/core/src/test/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNodeSuite.scala

+    assert(statements === Seq(
+      "statement1",
+      "lbl1"
+    ))


we don't have drop var statements here due to the fact that they are dropped in handleLeaveStatement?
this is the thing we talked about that will be properly resolved once the proper execution and scopes are introduced?

Yes, that's right. In this case the variables are dropped immediately when the leave statement is encountered, instead of the usual behavior which is to return the dropVariable exec nodes from the iterator.

sql/core/src/test/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNodeSuite.scala

davidm-db · 2024-11-20T19:35:41Z

sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNode.scala

+  private var isResultCacheValid = false
+  private def cachedQueryResult(): Array[Row] = {
+    if (!isResultCacheValid) {
+      queryResult = query.buildDataFrame(session).collect()


food for thought: does DataFrame have a mechanism to partially collect the data so we don't collect all the results in memory? since we are already using the caching concept, this would be easy to add to the logic of cachedQueryResult.

quickly researching, we can do something like:
sliced_df = df.offset(starting_index).limit(ending_index - starting_index)
but there might be something better...

I wouldn't block the PR on this, but I think we definitely need to consider something like this for a follow-up.

cc: @cloud-fan @MaxGekk

That makes sense, currently the entire result is collected to the driver so it would be problematic if the result size is too large. We should definitely follow up on this

Let's see what Wenchen and Max have to say and maybe create a follow-up work item so we don't forget it.

There is a df.toLocalIterator(). Under the hood, it launches jobs for RDD partitions one by one, so at most only the data of one Partition will be collected to the Spark driver at the same time.

@cloud-fan @davidm-db
I refactored to trydf.toLocalIterator()(minimal changes) and it seems to work properly. Could you take a look again?

I think we can simplify this a bit further, commented it here.
Otherwise, the idea is really cool. You already have my approval.

sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNode.scala

davidm-db · 2024-11-20T20:00:52Z

I've left comments, but in general the approach looks good to me!
This wasn't easy, good job!
Let's resolve the comments and fix tests (seems to be just syntax/style errors).

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala

...st/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/SqlScriptingLogicalPlans.scala

sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNode.scala

…ave/iterate/normal case

cloud-fan · 2024-11-28T05:09:00Z

sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNode.scala

+   */
+  private var interrupted: Boolean = false
+
+  private lazy val treeIterator: Iterator[CompoundStatementExec] =


@miland-db can you confirm whether we should return query or not here? Should we treat it the same as IF/WHILE condition and return it in the iterator?

cloud-fan · 2024-11-28T05:11:21Z

sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNode.scala

+  private var isResultCacheValid = false
+  private def cachedQueryResult(): util.Iterator[Row] = {
+    if (!isResultCacheValid) {
+      queryResult = query.buildDataFrame(session).toLocalIterator()


let's use SparkPlan#executeToIterator(), which returns InternalRow, so that we can save the cost of data conversion in createExpressionFromValue

cloud-fan · 2024-11-28T12:46:28Z

We can address comments later as they are kind of improvement. Let me merge it first. Thanks!

dusantism-db · 2024-11-28T12:48:29Z

@cloud-fan Great, thanks. I will create follow up tasks for the improvements you suggested.

github-actions bot added the SQL label Nov 7, 2024

dejankrak-db reviewed Nov 7, 2024

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala Show resolved Hide resolved

davidm-db reviewed Nov 8, 2024

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala Outdated Show resolved Hide resolved

davidm-db reviewed Nov 8, 2024

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNode.scala Outdated Show resolved Hide resolved

davidm-db reviewed Nov 8, 2024

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingInterpreter.scala Outdated Show resolved Hide resolved

davidm-db reviewed Nov 8, 2024

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNode.scala Outdated Show resolved Hide resolved

dusantism-db changed the title ~~[WIP][SPARK-48356][SQL] Support for FOR statement~~ [SPARK-48356][SQL] Support for FOR statement Nov 19, 2024

dusantism-db requested review from dejankrak-db and davidm-db November 19, 2024 15:37

miland-db reviewed Nov 19, 2024

View reviewed changes

dusantism-db force-pushed the scripting-for-loop branch from 446fc05 to 2e10f0b Compare November 20, 2024 15:26

davidm-db reviewed Nov 20, 2024

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNode.scala Outdated Show resolved Hide resolved

davidm-db reviewed Nov 20, 2024

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNodeSuite.scala Show resolved Hide resolved

davidm-db reviewed Nov 20, 2024

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNodeSuite.scala Outdated Show resolved Hide resolved

davidm-db reviewed Nov 20, 2024

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNodeSuite.scala Outdated Show resolved Hide resolved

davidm-db reviewed Nov 20, 2024

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNode.scala Outdated Show resolved Hide resolved

davidm-db reviewed Nov 20, 2024

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNode.scala Show resolved Hide resolved

dtenedor reviewed Nov 21, 2024

View reviewed changes

davidm-db reviewed Nov 21, 2024

View reviewed changes

...st/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/SqlScriptingLogicalPlans.scala Outdated Show resolved Hide resolved

davidm-db reviewed Nov 21, 2024

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNode.scala Show resolved Hide resolved

davidm-db approved these changes Nov 21, 2024

View reviewed changes

cloud-fan reviewed Nov 22, 2024

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNode.scala Show resolved Hide resolved

dusantism-db force-pushed the scripting-for-loop branch from 2504fb7 to 829c6d1 Compare November 22, 2024 11:38

davidm-db reviewed Nov 22, 2024

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNode.scala Show resolved Hide resolved

dusantism-db added 20 commits November 26, 2024 13:32

execution node tests - iterate and elave

4910fc2

start interpreter tests

9c01b57

refactor to support column access without qualifying

616c94c

update execution node test

78eb903

add drop variables

f498a50

fix for nested arrays, and change drop variable logic to work with le…

755ebe4

…ave/iterate/normal case

add nested tests

49017e4

add tests for no variables variant of for

9a2f5fa

clean up

cc632c3

update labels and tests

d444751

nit

f78e1fc

add unique label tests

055a1b2

fix formatting and improve tests

955e79c

implement daneils suggestions

c46750a

move isExecuted out of buildDataframe

e485c99

fix scalastyle

112b860

formatting

54271d0

refactor collect() to toLocalIterator()

223612e

fix scalastyle

9bf0b7a

add sum test

9d1cf29

dusantism-db force-pushed the scripting-for-loop branch from db804f5 to 9d1cf29 Compare November 26, 2024 12:32

cloud-fan mentioned this pull request Nov 26, 2024

[SPARK-48344][SQL] Add SQL Scripting Execution Framework #48950

Closed

dusantism-db added 2 commits November 27, 2024 14:53

Merge remote-tracking branch 'upstream/master' into scripting-for-loop

d4de13a

fix exec node test

3b3aebe

cloud-fan reviewed Nov 28, 2024

View reviewed changes

cloud-fan closed this in 2c2c0e0 Nov 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-48356][SQL] Support for FOR statement #48794

[SPARK-48356][SQL] Support for FOR statement #48794

dusantism-db commented Nov 7, 2024 •

edited

Loading

davidm-db Nov 8, 2024

dusantism-db Nov 8, 2024

davidm-db Nov 8, 2024

miland-db Nov 19, 2024

dusantism-db Nov 19, 2024

davidm-db commented Nov 20, 2024

dusantism-db commented Nov 20, 2024

davidm-db Nov 20, 2024

dusantism-db Nov 21, 2024

davidm-db Nov 20, 2024

dusantism-db Nov 21, 2024

davidm-db Nov 21, 2024

cloud-fan Nov 22, 2024

dusantism-db Nov 22, 2024

davidm-db Nov 22, 2024

davidm-db commented Nov 20, 2024

cloud-fan Nov 28, 2024

cloud-fan Nov 28, 2024

cloud-fan commented Nov 28, 2024

dusantism-db commented Nov 28, 2024

[SPARK-48356][SQL] Support for FOR statement #48794

[SPARK-48356][SQL] Support for FOR statement #48794

Conversation

dusantism-db commented Nov 7, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidm-db commented Nov 20, 2024

dusantism-db commented Nov 20, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidm-db commented Nov 20, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Nov 28, 2024

dusantism-db commented Nov 28, 2024

dusantism-db commented Nov 7, 2024 •

edited

Loading