Skip to content

Commit

Permalink
Generator Expressions
Browse files Browse the repository at this point in the history
  • Loading branch information
jaceklaskowski committed Apr 7, 2024
1 parent 7f8ef9b commit bfebf1a
Show file tree
Hide file tree
Showing 2 changed files with 58 additions and 72 deletions.
126 changes: 57 additions & 69 deletions docs/expressions/Generator.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,59 +4,76 @@ title: Generator

# Generator Expressions

`Generator` is a <<contract, contract>> for [Catalyst expressions](Expression.md) that can <<eval, produce>> zero or more rows given a single input row.
`Generator` is an [extension](#contract) of the [Expression](Expression.md) abstraction for [generator expressions](#implementations) that [can produce multiple rows for a single input row](#eval).

NOTE: `Generator` corresponds to SQL's sql/AstBuilder.md#withGenerate[LATERAL VIEW].
The execution of `Generator` is managed by [GenerateExec](../physical-operators/GenerateExec.md) unary physical operator.

[[dataType]]
`dataType` in `Generator` is simply an [ArrayType](../types/ArrayType.md) of <<elementSchema, elementSchema>>.
!!! note
`Generator` corresponds to [LATERAL VIEW](../sql/AstBuilder.md#withGenerate) in SQL.

[[foldable]]
[[nullable]]
`Generator` is not Expression.md#foldable[foldable] and not Expression.md#nullable[nullable] by default.
## Contract (Subset)

[[supportCodegen]]
`Generator` supports [Java code generation](../whole-stage-code-generation/index.md) conditionally, i.e. only when a physical operator is not marked as [CodegenFallback](Expression.md#CodegenFallback).
### Interpreted Expression Evaluation { #eval }

[[terminate]]
`Generator` uses `terminate` to inform that there are no more rows to process, clean up code, and additional rows can be made here.
```scala
eval(
input: InternalRow): TraversableOnce[InternalRow]
```

[source, scala]
----
terminate(): TraversableOnce[InternalRow] = Nil
----
Evaluates the given [InternalRow](../InternalRow.md) to produce zero, one or more [InternalRow](../InternalRow.md)s

[[generator-implementations]]
.Generators
[width="100%",cols="1,2",options="header"]
|===
| Name
| Description
!!! note "Return Type"
`eval` is part of the [Expression](Expression.md#eval) abstraction and this `eval` enforces that `Generator`s produce a collection of [InternalRow](../InternalRow.md)s (not any other value as by non-generator expressions).

## Implementations

* `CollectionGenerator`
* `GeneratorOuter`
* `HiveGenericUDTF`
* `JsonTuple`
* `ReplicateRows`
* `SQLKeywords`
* `Stack`
* `UnevaluableGenerator`
* [UnresolvedGenerator](UnresolvedGenerator.md)
* `UserDefinedGenerator`

| [[ExplodeBase]] spark-sql-Expression-ExplodeBase.md[ExplodeBase]
|
## Data Type { #dataType }

| [[Explode]] spark-sql-Expression-ExplodeBase.md#Explode[Explode]
|
??? note "Expression"

| [[GeneratorOuter]] `GeneratorOuter`
|
```scala
dataType: DataType
```

| [[HiveGenericUDTF]] `HiveGenericUDTF`
|
`dataType` is part of the [Expression](Expression.md#dataType) abstraction.

| [[Inline]] spark-sql-Expression-Inline.md[Inline]
| Corresponds to `inline` and `inline_outer` functions.
`dataType` is an [ArrayType](../types/ArrayType.md) of the [elementSchema](#elementSchema).

| JsonTuple
|
## supportCodegen { #supportCodegen }

| [[PosExplode]] spark-sql-Expression-ExplodeBase.md#PosExplode[PosExplode]
|
```scala
supportCodegen: Boolean
```

`supportCodegen` is enabled (`true`) when this `Generator` is not [CodegenFallback](../expressions/CodegenFallback.md)

`supportCodegen` is used when:

* `GenerateExec` physical operator is requested for [supportCodegen](../physical-operators/GenerateExec.md#supportCodegen)

<!---
## Review Me
| `Stack`
|
[[terminate]]
`Generator` uses `terminate` to inform that there are no more rows to process, clean up code, and additional rows can be made here.
[source, scala]
----
terminate(): TraversableOnce[InternalRow] = Nil
----
|===
| [[UnresolvedGenerator]] spark-sql-Expression-UnresolvedGenerator.md[UnresolvedGenerator]
a| Represents an unresolved <<Generator, generator>>.
Expand All @@ -70,9 +87,6 @@ AS? col1, col2, ...
```
`UnresolvedGenerator` is [resolved](../Analyzer.md#ResolveFunctions) to `Generator` by [ResolveFunctions](../Analyzer.md#ResolveFunctions) logical evaluation rule.

| [[UserDefinedGenerator]] `UserDefinedGenerator`
| Used exclusively in the deprecated `explode` operator
|===
[[lateral-view]]
Expand All @@ -90,8 +104,7 @@ org.apache.spark.sql.AnalysisException: Only one generator allowed per select cl
If you want to have more than one generator in a structured query you should use `LATERAL VIEW` which is supported in SQL only, e.g.
[source, scala]
----
```text
val arrayTuple = (Array(1,2,3), Array("a","b","c"))
val ncs = Seq(arrayTuple).toDF("ns", "cs")
Expand Down Expand Up @@ -124,31 +137,6 @@ scala> sql(q).show
| 3| b|
| 3| c|
+---+---+
----
```
====

=== [[contract]] Generator Contract

[source, scala]
----
package org.apache.spark.sql.catalyst.expressions

trait Generator extends Expression {
// only required methods that have no implementation
def elementSchema: StructType
def eval(input: InternalRow): TraversableOnce[InternalRow]
}
----

.(Subset of) Generator Contract
[cols="1,2",options="header",width="100%"]
|===
| Method
| Description

| [[elementSchema]] `elementSchema`
| [Schema](../types/StructType.md) of the elements to be generated

| [[eval]] `eval`
|
|===
-->
4 changes: 1 addition & 3 deletions docs/physical-operators/GenerateExec.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,10 @@ title: GenerateExec

# GenerateExec Unary Physical Operator

`GenerateExec` is a [unary physical operator](UnaryExecNode.md) with [CodegenSupport](CodegenSupport.md).
`GenerateExec` is a [unary physical operator](UnaryExecNode.md) to manage execution of a [Generator](#generator) expression.

`GenerateExec` represents [Generate](../logical-operators/Generate.md) unary logical operator at execution time.

`GenerateExec` is an executon environment for the [Generator](#generator) expression.

When [executed](#doExecute), `GenerateExec` [executes](../expressions/Generator.md#eval) (aka _evaluates_) the [Generator](#boundGenerator) expression on every row in a RDD partition.

![GenerateExec's Execution -- `doExecute` Method](../images/spark-sql-GenerateExec-doExecute.png)
Expand Down

0 comments on commit bfebf1a

Please sign in to comment.