[SPARK-45794][SS] Introduce state metadata source to query the streaming state metadata information #43660

chaoqin-li1123 · 2023-11-05T05:18:56Z

What changes were proposed in this pull request?

Introduce a new data source so that user can query the metadata of each state store of a streaming query, the schema of the result will be following:

column	type
operatorId	INT
operatorName	STRING
stateStoreName	STRING
numPartitions	INT
minBatchId	LONG
minBatchId	LONG
_numColsPrefixKey (metadata column)	INT

To use this source, specify the source format and checkpoint path and load the dataframe

df = spark.read.format(“state-metadata”).load(“/checkpointPath”)

Why are the changes needed?

To improve debugability. Also facilitate the query of state store data source introduced in SPARK-45511 by displaying the operator id, batch id and state store name.

Does this PR introduce any user-facing change?

Yes, this is a new source exposed to user.

How was this patch tested?

Add test to verify the output of state metadata

Was this patch authored or co-authored using generative AI tooling?

No.

chaoqin-li1123 · 2023-11-06T00:46:18Z

@HeartSaVioR PTAL, thanks!

HeartSaVioR

Only minors. Maybe the important part is how/where to document this, probably along with state data source once it is merged. We could file a separate ticket for it.

HeartSaVioR · 2023-11-07T08:06:19Z

...scala/org/apache/spark/sql/execution/datasources/v2/state/metadata/StateMetadataSource.scala

+
+  override def newScanBuilder(options: CaseInsensitiveStringMap): ScanBuilder = {
+    () => {
+      assert(options.containsKey("path"), "Must specify checkpoint path to read state metadata")


We should throw an IllegalArgumentException or proper error class. Let's do former and we can apply error class altogether for state data source & state metadata data source.

HeartSaVioR · 2023-11-07T08:14:34Z

...scala/org/apache/spark/sql/execution/datasources/v2/state/metadata/StateMetadataSource.scala

+      .add("operatorName", StringType)
+      .add("stateStoreName", StringType)
+      .add("numPartitions", IntegerType)
+      .add("numColsPrefixKey", IntegerType)


Can we make this be a metadata column? Probably adding a underbar as prefix as well - _numColsPrefixKey.

This is purely an internal one and most users won't have a context for this. We never require users to know this.

Makes sense, I make it a metadata column

HeartSaVioR

+1

HeartSaVioR · 2023-11-08T04:56:06Z

Filed a JIRA ticket https://issues.apache.org/jira/browse/SPARK-45833 for addressing documentation on both state data source and state metadata source.

HeartSaVioR · 2023-11-08T04:56:58Z

CI only failed from Run / Build modules: pyspark-mllib, pyspark-ml, pyspark-ml-connect which is irrelevant.

HeartSaVioR · 2023-11-08T05:03:52Z

Thanks! Merging to master.

add implementation and tests

c4b7498

github-actions bot added SQL STRUCTURED STREAMING labels Nov 5, 2023

clean up

76e0420

HeartSaVioR reviewed Nov 7, 2023

View reviewed changes

chaoqin-li1123 added 2 commits November 7, 2023 13:26

use metadata column and exception

cb2baa8

format

40d4c6b

HeartSaVioR changed the title ~~[SPARK-45794] [SS] Introduce state metadata source to query the streaming state metadata information~~ [SPARK-45794][SS] Introduce state metadata source to query the streaming state metadata information Nov 8, 2023

HeartSaVioR approved these changes Nov 8, 2023

View reviewed changes

HeartSaVioR closed this in 4452275 Nov 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-45794][SS] Introduce state metadata source to query the streaming state metadata information #43660

[SPARK-45794][SS] Introduce state metadata source to query the streaming state metadata information #43660

chaoqin-li1123 commented Nov 5, 2023 •

edited by HeartSaVioR

Loading

chaoqin-li1123 commented Nov 6, 2023

HeartSaVioR left a comment

HeartSaVioR Nov 7, 2023

chaoqin-li1123 Nov 7, 2023

HeartSaVioR Nov 7, 2023

chaoqin-li1123 Nov 7, 2023

HeartSaVioR left a comment

HeartSaVioR commented Nov 8, 2023

HeartSaVioR commented Nov 8, 2023

HeartSaVioR commented Nov 8, 2023

[SPARK-45794][SS] Introduce state metadata source to query the streaming state metadata information #43660

[SPARK-45794][SS] Introduce state metadata source to query the streaming state metadata information #43660

Conversation

chaoqin-li1123 commented Nov 5, 2023 • edited by HeartSaVioR Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

chaoqin-li1123 commented Nov 6, 2023

HeartSaVioR left a comment

Choose a reason for hiding this comment

HeartSaVioR Nov 7, 2023

Choose a reason for hiding this comment

chaoqin-li1123 Nov 7, 2023

Choose a reason for hiding this comment

HeartSaVioR Nov 7, 2023

Choose a reason for hiding this comment

chaoqin-li1123 Nov 7, 2023

Choose a reason for hiding this comment

HeartSaVioR left a comment

Choose a reason for hiding this comment

HeartSaVioR commented Nov 8, 2023

HeartSaVioR commented Nov 8, 2023

HeartSaVioR commented Nov 8, 2023

chaoqin-li1123 commented Nov 5, 2023 •

edited by HeartSaVioR

Loading