diff --git a/docs/connector/KeyGroupedPartitioning.md b/docs/connector/KeyGroupedPartitioning.md new file mode 100644 index 0000000000..a0f7c8e028 --- /dev/null +++ b/docs/connector/KeyGroupedPartitioning.md @@ -0,0 +1,15 @@ +# KeyGroupedPartitioning + +`KeyGroupedPartitioning` is a [Partitioning](Partitioning.md) where rows are split across partitions based on the [partition transform expressions](#keys). + +`KeyGroupedPartitioning` is a key part of [Storage-Partitioned Joins](../storage-partitioned-joins/index.md). + +!!! note + Not used in any of the [built-in Spark SQL connectors](../connectors/index.md) yet. + +## Creating Instance + +`KeyGroupedPartitioning` takes the following to be created: + +* Partition transform [expression](../expressions/Expression.md)s +* Number of partitions diff --git a/docs/connector/Partitioning.md b/docs/connector/Partitioning.md index a558dd0490..7f289abddd 100644 --- a/docs/connector/Partitioning.md +++ b/docs/connector/Partitioning.md @@ -4,14 +4,14 @@ title: Partitioning # Partitioning -`Partitioning` is an [abstraction](#contract) of [output data partitioning requirements](#implementations) (_data distribution_) of a Spark SQL connector. +`Partitioning` is an [abstraction](#contract) of [output data partitioning requirements](#implementations) (_data distribution_) of a [Spark SQL connector](index.md). !!! note This `Partitioning` interface for Spark SQL developers mimics the internal Catalyst [Partitioning](../physical-operators/Partitioning.md) that is converted into with the help of [DataSourcePartitioning](../physical-operators/Partitioning.md#DataSourcePartitioning). ## Contract -###  Number of Partitions +### Number of Partitions { #numPartitions } ```java int numPartitions() @@ -21,7 +21,7 @@ Used when: * [DataSourcePartitioning](../physical-operators/Partitioning.md#DataSourcePartitioning) is requested for the [number of partitions](../physical-operators/Partitioning.md#numPartitions) -###  Satisfying Distribution +### Satisfying Distribution { #satisfy } ```java boolean satisfy( @@ -34,5 +34,5 @@ Used when: ## Implementations -* `KeyGroupedPartitioning` +* [KeyGroupedPartitioning](KeyGroupedPartitioning.md) * `UnknownPartitioning` diff --git a/docs/storage-partitioned-joins/.pages b/docs/storage-partitioned-joins/.pages new file mode 100644 index 0000000000..3d3619dc26 --- /dev/null +++ b/docs/storage-partitioned-joins/.pages @@ -0,0 +1,4 @@ +title: Storage-Partitioned Joins +nav: + - index.md + - ... diff --git a/docs/storage-partitioned-joins/index.md b/docs/storage-partitioned-joins/index.md new file mode 100644 index 0000000000..17c575fdd4 --- /dev/null +++ b/docs/storage-partitioned-joins/index.md @@ -0,0 +1,12 @@ +# Storage-Partitioned Joins + +**Storage-Partitioned Joins** (_SPJ_) are a new type of [join](../joins.md) in Spark SQL that use the existing storage layout for a partitioned join to avoid expensive shuffles (similarly to [Bucketing](../bucketing/index.md)). + +!!! note + Storage-Partitioned Joins feature was added in Apache Spark 3.3.0 ([\[SPARK-37375\] Umbrella: Storage Partitioned Join (SPJ)]({{ spark.jira }}/SPARK-37375)). + +Storage-Partitioned Join is meant mainly, if not exclusively, for [Spark SQL connectors](../connector/index.md) (_v2 data sources_). + +Storage-Partitioned Join was proposed in this [SPIP](https://docs.google.com/document/d/1foTkDSM91VxKgkEcBMsuAvEjNybjja-uHk-r3vtXWFE). + +Storage-Partitioned Join uses [KeyGroupedPartitioning](../connector/KeyGroupedPartitioning.md) to determine partitions. diff --git a/mkdocs.yml b/mkdocs.yml index 3480cbc6e7..f123de25b5 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -163,6 +163,7 @@ nav: - ... | bloom-filter-join/**.md - ... | bucketing/**.md - ... | cache-serialization/**.md + - ... | storage-partitioned-joins/**.md - Catalog Plugin API: - connector/catalog/index.md - CatalogExtension: connector/catalog/CatalogExtension.md