Skip to content

Commit

Permalink
Standard Functions
Browse files Browse the repository at this point in the history
  • Loading branch information
jaceklaskowski committed Apr 7, 2024
1 parent bfebf1a commit 43c3665
Show file tree
Hide file tree
Showing 14 changed files with 201 additions and 74 deletions.
2 changes: 1 addition & 1 deletion docs/Column.md
Original file line number Diff line number Diff line change
Expand Up @@ -244,7 +244,7 @@ over(window: WindowSpec): Column
`over` creates a _windowing column_ (_aka_ _analytic clause_) that allows to execute an aggregate function over a [window](window-functions/WindowSpec.md) (i.e. a group of records that are in _some_ relation to the current record).
TIP: Read up on windowed aggregation in Spark SQL in spark-sql-functions-windows.md[Window Aggregate Functions].
TIP: Read up on windowed aggregation in Spark SQL in functions/windows-functions.md[Window Aggregate Functions].
[source, scala]
----
Expand Down
4 changes: 3 additions & 1 deletion docs/expressions/Generator.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,9 @@ eval(
Evaluates the given [InternalRow](../InternalRow.md) to produce zero, one or more [InternalRow](../InternalRow.md)s

!!! note "Return Type"
`eval` is part of the [Expression](Expression.md#eval) abstraction and this `eval` enforces that `Generator`s produce a collection of [InternalRow](../InternalRow.md)s (not any other value as by non-generator expressions).
`eval` is part of the [Expression](Expression.md#eval) abstraction.

This `eval` enforces that `Generator`s produce a collection of [InternalRow](../InternalRow.md)s (not any other value as by non-generator expressions).

## Implementations

Expand Down
7 changes: 7 additions & 0 deletions docs/expressions/MaxBy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
---
title: MaxBy
---

# MaxBy Expression

`MaxBy` is a `MaxMinBy` aggregate function expression.
2 changes: 1 addition & 1 deletion docs/expressions/ParseToDate.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# ParseToDate

`ParseToDate` is a [RuntimeReplaceable](RuntimeReplaceable.md) expression to represent [to_date](../spark-sql-functions-datetime.md#to_date) function (in logical query plans).
`ParseToDate` is a [RuntimeReplaceable](RuntimeReplaceable.md) expression to represent [to_date](../functions/datetime.md#to_date) function (in logical query plans).

As a `RuntimeReplaceable` expression, `ParseToDate` is replaced by [Logical Query Optimizer](../catalyst/Optimizer.md#ReplaceExpressions) with the [child](#child) expression:

Expand Down
2 changes: 1 addition & 1 deletion docs/expressions/ParseToTimestamp.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# ParseToTimestamp

`ParseToTimestamp` is a [RuntimeReplaceable](RuntimeReplaceable.md) expression to represent [to_timestamp](../spark-sql-functions-datetime.md#to_timestamp) standard function (in logical query plans).
`ParseToTimestamp` is a [RuntimeReplaceable](RuntimeReplaceable.md) expression to represent [to_timestamp](../functions/datetime.md#to_timestamp) standard function (in logical query plans).

As a `RuntimeReplaceable` expression, `ParseToTimestamp` is replaced by [Logical Optimizer](../catalyst/Optimizer.md#ReplaceExpressions) with the [child](#child) expression:

Expand Down
8 changes: 6 additions & 2 deletions docs/expressions/UnixTimestamp.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# UnixTimestamp

`UnixTimestamp` is a [binary](Expression.md#BinaryExpression) expression with [timezone](Expression.md#TimeZoneAwareExpression) support that represents [unix_timestamp](../spark-sql-functions-datetime.md#unix_timestamp) function (and indirectly [to_date](../spark-sql-functions-datetime.md#to_date) and [to_timestamp](../spark-sql-functions-datetime.md#to_timestamp)).
`UnixTimestamp` is a [binary](Expression.md#BinaryExpression) expression with [timezone](Expression.md#TimeZoneAwareExpression) support that represents [unix_timestamp](../functions/datetime.md#unix_timestamp) function (and indirectly [to_date](../functions/datetime.md#to_date) and [to_timestamp](../functions/datetime.md#to_timestamp)).

```text
import org.apache.spark.sql.functions.unix_timestamp
Expand All @@ -19,15 +19,19 @@ scala> c1.expr.isInstanceOf[UnixTimestamp]
res0: Boolean = true
```

<!---
## Review Me
NOTE: `UnixTimestamp` is `UnixTime` expression internally (as is `ToUnixTimestamp` expression).
[[inputTypes]][[dataType]]
`UnixTimestamp` supports `StringType`, [DateType](../types/DataType.md#DateType) and `TimestampType` as input types for a time expression and returns `LongType`.
```
```text
scala> c1.expr.eval()
res1: Any = 1493354303
```
[[formatter]]
`UnixTimestamp` uses `DateTimeUtils.newDateFormat` for date/time format (as Java's [java.text.DateFormat]({{ java.api }}/java/text/DateFormat.html)).
-->
78 changes: 77 additions & 1 deletion docs/functions/aggregate-functions.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,45 @@
# Standard Aggregate Functions

## <span id="collect_set"> collect_set
## any { #any }

```scala
any(
e: Column): Column
```

`any`...FIXME

## any_value { #any_value }

```scala
any_value(
e: Column): Column
any_value(
e: Column,
ignoreNulls: Column): Column
```

`any_value`...FIXME

## bool_and { #bool_and }

```scala
bool_and(
e: Column): Column
```

`bool_and`...FIXME

## bool_or { #bool_or }

```scala
bool_or(
e: Column): Column
```

`bool_or`...FIXME

## collect_set { #collect_set }

```scala
collect_set(
Expand All @@ -13,6 +52,43 @@ collect_set(

In the end, `collect_set` wraps the [AggregateExpression](../expressions/AggregateExpression.md) up in a [Column](../Column.md).

## count_if { #count_if }

```scala
count_if(
e: Column): Column
```

`count_if`...FIXME

## every { #every }

```scala
every(
e: Column): Column
```

`every`...FIXME

## max_by { #max_by }

```scala
max_by(
e: Column,
ord: Column): Column
```

`max_by` creates a [MaxBy](../expressions/MaxBy.md) aggregate function that is then [wrapped into a Column](../functions/index.md#withAggregateFunction) (as an [AggregateExpression](../expressions/AggregateExpression.md)).

## some { #some }

```scala
some(
e: Column): Column
```

`some`...FIXME

<!---
## Review Me
Expand Down
18 changes: 17 additions & 1 deletion docs/functions/collection-functions.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Standard Collection Functions

## <span id="filter"> filter
## filter { #filter }

```scala
filter(
Expand All @@ -15,6 +15,22 @@ filter(

In the end, `collect_set` wraps the `ArrayFilter` up in a [Column](../Column.md).

## str_to_map { #str_to_map }

```scala
str_to_map(
text: Column): Column
str_to_map(
text: Column,
pairDelim: Column): Column
str_to_map(
text: Column,
pairDelim: Column,
keyValueDelim: Column): Column
```

`str_to_map`...FIXME

<!---
## Review Me
Expand Down
96 changes: 50 additions & 46 deletions docs/spark-sql-functions-datetime.md → docs/functions/datetime.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,48 @@
# Date and Time Functions

## to_date { #to_date }

```scala
to_date(
e: Column): Column
to_date(
e: Column,
fmt: String): Column
```

`to_date` converts the column into [DateType](../types/DataType.md#DateType) (by casting to `DateType`).

!!! note
`fmt` follows [the formatting styles](http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html).

Internally, `to_date` creates a [Column](../Column.md) with [ParseToDate](../expressions/ParseToDate.md) expression (and `Literal` expression for `fmt`).

!!! tip
Use [ParseToDate](../expressions/ParseToDate.md) expression to use a column for the values of `fmt`.

## to_timestamp { #to_timestamp }

```scala
to_timestamp(
s: Column): Column
to_timestamp(
s: Column,
fmt: String): Column
```

`to_timestamp` converts the column into [TimestampType](../types/DataType.md#TimestampType) (by casting to `TimestampType`).

!!! note
`fmt` follows [the formatting styles](http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html).

Internally, `to_timestamp` creates a [Column](../Column.md) with [ParseToTimestamp](../expressions/ParseToTimestamp.md) expression (and `Literal` expression for `fmt`).

!!! tip
Use [ParseToTimestamp](../expressions/ParseToTimestamp.md) expression to use a column for the values of `fmt`.

<!---
## Review Me
[[functions]]
.(Subset of) Standard Functions for Date and Time
[align="center",cols="1,2",width="100%",options="header"]
Expand Down Expand Up @@ -35,7 +78,7 @@
current_date(): Column
```
`current_date` function gives the current date as a [date](types/DataType.md#DateType) column.
`current_date` function gives the current date as a [date](../types/DataType.md#DateType) column.
```text
val df = spark.range(1).select(current_date)
Expand All @@ -51,7 +94,7 @@ root
|-- current_date(): date (nullable = false)
```
Internally, `current_date` creates a [Column](Column.md) with `CurrentDate` Catalyst leaf expression.
Internally, `current_date` creates a [Column](../Column.md) with `CurrentDate` Catalyst leaf expression.
```text
val c = current_date()
Expand All @@ -70,7 +113,7 @@ scala> println(cd.numberedTreeString)
date_format(dateExpr: Column, format: String): Column
```
Internally, `date_format` creates a [Column](Column.md) with `DateFormatClass` binary expression. `DateFormatClass` takes the expression from `dateExpr` column and `format`.
Internally, `date_format` creates a [Column](../Column.md) with `DateFormatClass` binary expression. `DateFormatClass` takes the expression from `dateExpr` column and `format`.
```text
val c = date_format($"date", "dd/MM/yyyy")
Expand Down Expand Up @@ -161,7 +204,7 @@ scala> spark.sql("SELECT unix_timestamp() as unix_timestamp").show
+--------------+
```
Internally, `unix_timestamp` creates a [Column](Column.md) with [UnixTimestamp](expressions/UnixTimestamp.md) binary expression (possibly with `CurrentTimestamp`).
Internally, `unix_timestamp` creates a [Column](../Column.md) with [UnixTimestamp](../expressions/UnixTimestamp.md) binary expression (possibly with `CurrentTimestamp`).
=== [[window]] Generating Time Windows -- `window` Function
Expand Down Expand Up @@ -208,7 +251,7 @@ scala> val timeColumn = window('time, "5 seconds")
timeColumn: org.apache.spark.sql.Column = timewindow(time, 5000000, 5000000, 0) AS `window`
----
`timeColumn` should be of [TimestampType](types/DataType.md#TimestampType), i.e. with [java.sql.Timestamp]({{ java.api }}/java/sql/Timestamp.html) values.
`timeColumn` should be of [TimestampType](../types/DataType.md#TimestampType), i.e. with [java.sql.Timestamp]({{ java.api }}/java/sql/Timestamp.html) values.
!!! tip
Use [java.sql.Timestamp.from]({{ java.api }}/java/sql/Timestamp.html#from-java.time.Instant-) or [java.sql.Timestamp.valueOf]({{ java.api }}/java/sql/Timestamp.html#valueOf-java.time.LocalDateTime-) factory methods to create `Timestamp` instances.
Expand Down Expand Up @@ -279,7 +322,7 @@ scala> sums.show
!!! TIP
Use `CalendarInterval` for valid window identifiers.
Internally, `window` creates a [Column](Column.md) (with [TimeWindow](expressions/TimeWindow.md) expression) available as `window` alias.
Internally, `window` creates a [Column](../Column.md) (with [TimeWindow](../expressions/TimeWindow.md) expression) available as `window` alias.
```text
// q is the query defined earlier
Expand All @@ -305,43 +348,4 @@ scala> println(timeColumn.expr.numberedTreeString)
NOTE: The example is borrowed from https://flink.apache.org/news/2015/12/04/Introducing-windows.html[Introducing Stream Windows in Apache Flink].
The example shows how to use `window` function to model a traffic sensor that counts every 15 seconds the number of vehicles passing a certain location.

## <span id="to_date"> to_date

```scala
to_date(
e: Column): Column
to_date(
e: Column,
fmt: String): Column
```

`to_date` converts the column into [DateType](types/DataType.md#DateType) (by casting to `DateType`).

!!! note
`fmt` follows [the formatting styles](http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html).

Internally, `to_date` creates a [Column](Column.md) with [ParseToDate](expressions/ParseToDate.md) expression (and `Literal` expression for `fmt`).

!!! tip
Use [ParseToDate](expressions/ParseToDate.md) expression to use a column for the values of `fmt`.

## <span id="to_timestamp"> to_timestamp

```scala
to_timestamp(
s: Column): Column
to_timestamp(
s: Column,
fmt: String): Column
```

`to_timestamp` converts the column into [TimestampType](types/DataType.md#TimestampType) (by casting to `TimestampType`).

!!! note
`fmt` follows [the formatting styles](http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html).

Internally, `to_timestamp` creates a [Column](Column.md) with [ParseToTimestamp](expressions/ParseToTimestamp.md) expression (and `Literal` expression for `fmt`).

!!! tip
Use [ParseToTimestamp](expressions/ParseToTimestamp.md) expression to use a column for the values of `fmt`.
-->
Loading

0 comments on commit 43c3665

Please sign in to comment.