Skip to content

Latest commit

 

History

History
102 lines (66 loc) · 4.19 KB

spark-sql-streaming-DataSource.adoc

File metadata and controls

102 lines (66 loc) · 4.19 KB

DataSource — Pluggable Data Source

DataSource is…​FIXME

DataSource is created when…​FIXME

Tip
Read DataSource — Pluggable Data Sources (for Spark SQL’s batch structured queries).
Table 1. DataSource’s Internal Properties (e.g. Registries, Counters and Flags)
Name Description

providingClass

java.lang.Class that corresponds to the className (that can be a fully-qualified class name or an alias of the data source)

sourceInfo

SourceInfo with the name, the schema, and optional partitioning columns of a source.

Used when:

Describing Name and Schema of Streaming Source — sourceSchema Internal Method

sourceSchema(): SourceInfo

sourceSchema…​FIXME

Note
sourceSchema is used exclusively when DataSource is requested for the SourceInfo.

Creating DataSource Instance

DataSource takes the following when created:

  • SparkSession

  • className, i.e. the fully-qualified class name or an alias of the data source

  • Paths (default: Nil, i.e. an empty collection)

  • Optional user-defined schema (default: None)

  • Names of the partition columns (default: empty)

  • Optional BucketSpec (default: None)

  • Configuration options (default: empty)

  • Optional CatalogTable (default: None)

DataSource initializes the internal registries and counters.

Creating Streaming Source — createSource Method

createSource(metadataPath: String): Source

createSource…​FIXME

Note
createSource is used exclusively when MicroBatchExecution is requested to initialize the analyzed logical plan.

Creating Streaming Sink — createSink Method

createSink(outputMode: OutputMode): Sink

createSink creates a streaming sink for StreamSinkProvider or FileFormat data sources.

Internally, createSink creates a new instance of the providingClass and branches off per type:

createSink throws a IllegalArgumentException when path option is not specified for a FileFormat data source:

'path' is not specified

createSink throws an AnalysisException when the given OutputMode is different from Append for a FileFormat data source:

Data source [className] does not support [outputMode] output mode

createSink throws an UnsupportedOperationException for unsupported data source formats:

Data source [className] does not support streamed writing
Note
createSink is used exclusively when DataStreamWriter is requested to create and start a streaming query.