Skip to content

Commit

Permalink
[SPARK-49378][DOCS][SS] Break apart the Structured Streaming Programm…
Browse files Browse the repository at this point in the history
…ing Guide

### What changes were proposed in this pull request?

These changes break the Structured Streaming Programming Guide into smaller sub-pages **without changing any content**. You can see a preview of it [here](https://nr-spark-site.vercel.app/).

I broke up the pages by `h1` tag; within pages, the sub-sections on the left menu are broken up by `h2`. The SS programming guide now will resemble the SQL programming guide and the MLLib programming guide.

Additionally, to avoid cluttering the top-level namespace (there are dozens of `sql-*` files for the SQL reference), we nest all streaming docs in by one directory, namely the `/streaming/`. This has the side-effect of breaking links from our `_layouts`, since we assume a flat top-level namespace. To fix this issue, URLs in global layout files now all use absolute paths.

This move to `/streaming/` has the consequence that bookmarks of `https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html` will not refer to the actual programming guide content. In anticipation of this, I have kept all pages for existing URLs present with links to the pages in their new locations. This includes the new state data source and the Kafka integration guide.

In the future, we'll be able to quite easily (and in-parallel) break the programming guide apart further. This PR does all of the plumbing to make it work.

![image](https://github.com/user-attachments/assets/3eca87d4-9fb7-453c-a74a-20bd5c504d87)

It is future work to fix the oddly-sized left-navigation bar for our menus.

### Why are the changes needed?

One of the major hurdles that users have with Structured Streaming is that our guide is exceptionally long—it feels insurmountable, especially compared to other engines like Flink, which has many sub-pages.

Google also has a very tricky time indexing the single large page; if you Google "[structured streaming output mode](https://www.google.com/search?q=structured+streaming+output+mode)" and you click on the link to our programming guide... nothing happens. You aren't taken to the actual content, since Google has trouble with indexing to specific heading tags.

### Does this PR introduce _any_ user-facing change?

The structure of the website, with respect to Structured Streaming-related pages, is now changed. See the earlier parts of the PR description for the specific changes.

However, **no** content is changed. This should make reviewing the changes much easier.

### How was this patch tested?

I have used automated tools (e.g. [Lychee](https://github.com/lycheeverse/lychee)) and manual verification (i.e. clicking on every link) to make sure that I didn't break any links. It isn't fool-proof, though.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #47864 from neilramaswamy/nr/streaming-guide-breakapart.

Lead-authored-by: Neil Ramaswamy <[email protected]>
Co-authored-by: Kent Yao <[email protected]>
Signed-off-by: Kent Yao <[email protected]>
  • Loading branch information
neilramaswamy and yaooqinn committed Aug 30, 2024
1 parent 53c1f31 commit 493ca98
Show file tree
Hide file tree
Showing 19 changed files with 5,728 additions and 5,506 deletions.
57 changes: 57 additions & 0 deletions docs/_data/menu-streaming.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

- text: Overview
url: streaming/index.html
- text: Getting Started
url: streaming/getting-started.html
subitems:
- text: Quick Example
url: streaming/getting-started.html#quick-example
- text: Programming Model
url: streaming/getting-started.html#programming-model
- text: APIs on DataFrames and Datasets
url: streaming/apis-on-dataframes-and-datasets.html
subitems:
- text: Creating Streaming DataFrames and Streaming Datasets
url: streaming/apis-on-dataframes-and-datasets.html#creating-streaming-dataframes-and-streaming-datasets
- text: Operations on Streaming DataFrames/Datasets
url: streaming/apis-on-dataframes-and-datasets.html#operations-on-streaming-dataframesdatasets
- text: Starting Streaming Queries
url: streaming/apis-on-dataframes-and-datasets.html#starting-streaming-queries
- text: Managing Streaming Queries
url: streaming/apis-on-dataframes-and-datasets.html#managing-streaming-queries
- text: Monitoring Streaming Queries
url: streaming/apis-on-dataframes-and-datasets.html#monitoring-streaming-queries
- text: Recovering from Failures with Checkpointing
url: streaming/apis-on-dataframes-and-datasets.html#recovering-from-failures-with-checkpointing
- text: Recovery Semantics after Changes in a Streaming Query
url: streaming/apis-on-dataframes-and-datasets.html#recovery-semantics-after-changes-in-a-streaming-query
- text: Performance Tips
url: streaming/performance-tips.html
subitems:
- text: Asynchronous Progress Tracking
url: streaming/performance-tips.html#asynchronous-progress-tracking
- text: Continuous Processing
url: streaming/performance-tips.html#continuous-processing
- text: Additional Information
url: streaming/additional-information.html
subitems:
- text: Miscellaneous Notes
url: streaming/additional-information.html#miscellaneous-notes
- text: Related Resources
url: streaming/additional-information.html#related-resources
- text: Migration Guide
url: streaming/additional-information.html#migration-guide
22 changes: 22 additions & 0 deletions docs/_includes/nav-left-wrapper-streaming.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
{% comment %}
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
{% endcomment %}
<div class="left-menu-wrapper">
<div class="left-menu">
<h3><a href="{{ rel_path_to_root }}streaming/index.html">Structured Streaming Programming Guide</a></h3>
{% include nav-left.html nav=include.nav-streaming %}
</div>
</div>
2 changes: 1 addition & 1 deletion docs/_includes/nav-left.html
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
<ul>
{% for item in include.nav %}
<li>
<a href="{{ item.url }}">
<a href="{{ rel_path_to_root }}{{ item.url }}">
{% if navurl contains item.url %}
<b>{{ item.text }}</b>
{% else %}
Expand Down
93 changes: 50 additions & 43 deletions docs/_layouts/global.html
Original file line number Diff line number Diff line change
@@ -1,3 +1,9 @@
{% assign current_page_segments = page.dir | split: "/" | where_exp: "element","element != ''" %}
{% assign rel_path_to_root = "" %}
{% for i in (1..current_page_segments.size) %}
{% assign rel_path_to_root = rel_path_to_root | append: "../" %}
{% endfor %}

<!DOCTYPE html>
<html class="no-js">
<head>
Expand All @@ -21,12 +27,12 @@
<link rel="preconnect" href="https://fonts.googleapis.com">
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
<link href="https://fonts.googleapis.com/css2?family=DM+Sans:ital,wght@0,400;0,500;0,700;1,400;1,500;1,700&Courier+Prime:wght@400;700&display=swap" rel="stylesheet">
<link href="css/custom.css" rel="stylesheet">
<script src="js/vendor/modernizr-2.6.1-respond-1.1.0.min.js"></script>
<link href="{{ rel_path_to_root }}css/custom.css" rel="stylesheet">
<script src="/js/vendor/modernizr-2.6.1-respond-1.1.0.min.js"></script>

<link rel="stylesheet" href="css/pygments-default.css">
<link rel="stylesheet" href="{{ rel_path_to_root }}css/pygments-default.css">
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/docsearch.js@2/dist/cdn/docsearch.min.css" />
<link rel="stylesheet" href="css/docsearch.css">
<link rel="stylesheet" href="{{ rel_path_to_root }}css/docsearch.css">

{% production %}
<!-- Matomo -->
Expand All @@ -51,8 +57,8 @@
<body class="global">
<!-- This code is taken from http://twitter.github.com/bootstrap/examples/hero.html -->
<nav class="navbar navbar-expand-lg navbar-dark p-0 px-4 fixed-top" style="background: #1d6890;" id="topbar">
<div class="navbar-brand"><a href="index.html">
<img src="img/spark-logo-rev.svg" width="141" height="72"/></a><span class="version">{{site.SPARK_VERSION_SHORT}}</span>
<div class="navbar-brand"><a href="{{ rel_path_to_root }}index.html">
<img src="/img/spark-logo-rev.svg" width="141" height="72"/></a><span class="version">{{site.SPARK_VERSION_SHORT}}</span>
</div>
<button class="navbar-toggler" type="button" data-toggle="collapse"
data-target="#navbarCollapse" aria-controls="navbarCollapse"
Expand All @@ -61,58 +67,58 @@
</button>
<div class="collapse navbar-collapse" id="navbarCollapse">
<ul class="navbar-nav me-auto">
<li class="nav-item"><a href="index.html" class="nav-link">Overview</a></li>
<li class="nav-item"><a href="{{ rel_path_to_root }}index.html" class="nav-link">Overview</a></li>

<li class="nav-item dropdown">
<a href="#" class="nav-link dropdown-toggle" id="navbarQuickStart" role="button" data-toggle="dropdown" aria-haspopup="true" aria-expanded="false">Programming Guides</a>
<div class="dropdown-menu" aria-labelledby="navbarQuickStart">
<a class="dropdown-item" href="quick-start.html">Quick Start</a>
<a class="dropdown-item" href="rdd-programming-guide.html">RDDs, Accumulators, Broadcasts Vars</a>
<a class="dropdown-item" href="sql-programming-guide.html">SQL, DataFrames, and Datasets</a>
<a class="dropdown-item" href="structured-streaming-programming-guide.html">Structured Streaming</a>
<a class="dropdown-item" href="streaming-programming-guide.html">Spark Streaming (DStreams)</a>
<a class="dropdown-item" href="ml-guide.html">MLlib (Machine Learning)</a>
<a class="dropdown-item" href="graphx-programming-guide.html">GraphX (Graph Processing)</a>
<a class="dropdown-item" href="sparkr.html">SparkR (R on Spark)</a>
<a class="dropdown-item" href="api/python/getting_started/index.html">PySpark (Python on Spark)</a>
<a class="dropdown-item" href="{{ rel_path_to_root }}quick-start.html">Quick Start</a>
<a class="dropdown-item" href="{{ rel_path_to_root }}rdd-programming-guide.html">RDDs, Accumulators, Broadcasts Vars</a>
<a class="dropdown-item" href="{{ rel_path_to_root }}sql-programming-guide.html">SQL, DataFrames, and Datasets</a>
<a class="dropdown-item" href="{{ rel_path_to_root }}streaming/index.html">Structured Streaming</a>
<a class="dropdown-item" href="{{ rel_path_to_root }}streaming-programming-guide.html">Spark Streaming (DStreams)</a>
<a class="dropdown-item" href="{{ rel_path_to_root }}ml-guide.html">MLlib (Machine Learning)</a>
<a class="dropdown-item" href="{{ rel_path_to_root }}graphx-programming-guide.html">GraphX (Graph Processing)</a>
<a class="dropdown-item" href="{{ rel_path_to_root }}sparkr.html">SparkR (R on Spark)</a>
<a class="dropdown-item" href="{{ rel_path_to_root }}api/python/getting_started/index.html">PySpark (Python on Spark)</a>
</div>
</li>

<li class="nav-item dropdown">
<a href="#" class="nav-link dropdown-toggle" id="navbarAPIDocs" role="button" data-toggle="dropdown" aria-haspopup="true" aria-expanded="false">API Docs</a>
<div class="dropdown-menu" aria-labelledby="navbarAPIDocs">
<a class="dropdown-item" href="api/python/index.html">Python</a>
<a class="dropdown-item" href="api/scala/org/apache/spark/index.html">Scala</a>
<a class="dropdown-item" href="api/java/index.html">Java</a>
<a class="dropdown-item" href="api/R/index.html">R</a>
<a class="dropdown-item" href="api/sql/index.html">SQL, Built-in Functions</a>
<a class="dropdown-item" href="{{ rel_path_to_root }}api/python/index.html">Python</a>
<a class="dropdown-item" href="{{ rel_path_to_root }}api/scala/org/apache/spark/index.html">Scala</a>
<a class="dropdown-item" href="{{ rel_path_to_root }}api/java/index.html">Java</a>
<a class="dropdown-item" href="{{ rel_path_to_root }}api/R/index.html">R</a>
<a class="dropdown-item" href="{{ rel_path_to_root }}api/sql/index.html">SQL, Built-in Functions</a>
</div>
</li>

<li class="nav-item dropdown">
<a href="#" class="nav-link dropdown-toggle" id="navbarDeploying" role="button" data-toggle="dropdown" aria-haspopup="true" aria-expanded="false">Deploying</a>
<div class="dropdown-menu" aria-labelledby="navbarDeploying">
<a class="dropdown-item" href="cluster-overview.html">Overview</a>
<a class="dropdown-item" href="submitting-applications.html">Submitting Applications</a>
<a class="dropdown-item" href="{{ rel_path_to_root }}cluster-overview.html">Overview</a>
<a class="dropdown-item" href="{{ rel_path_to_root }}submitting-applications.html">Submitting Applications</a>
<div class="dropdown-divider"></div>
<a class="dropdown-item" href="spark-standalone.html">Spark Standalone</a>
<a class="dropdown-item" href="running-on-yarn.html">YARN</a>
<a class="dropdown-item" href="running-on-kubernetes.html">Kubernetes</a>
<a class="dropdown-item" href="{{ rel_path_to_root }}spark-standalone.html">Spark Standalone</a>
<a class="dropdown-item" href="{{ rel_path_to_root }}running-on-yarn.html">YARN</a>
<a class="dropdown-item" href="{{ rel_path_to_root }}running-on-kubernetes.html">Kubernetes</a>
</div>
</li>

<li class="nav-item dropdown">
<a href="#" class="nav-link dropdown-toggle" id="navbarMore" role="button" data-toggle="dropdown" aria-haspopup="true" aria-expanded="false">More</a>
<div class="dropdown-menu" aria-labelledby="navbarMore">
<a class="dropdown-item" href="configuration.html">Configuration</a>
<a class="dropdown-item" href="monitoring.html">Monitoring</a>
<a class="dropdown-item" href="tuning.html">Tuning Guide</a>
<a class="dropdown-item" href="job-scheduling.html">Job Scheduling</a>
<a class="dropdown-item" href="security.html">Security</a>
<a class="dropdown-item" href="hardware-provisioning.html">Hardware Provisioning</a>
<a class="dropdown-item" href="migration-guide.html">Migration Guide</a>
<a class="dropdown-item" href="{{ rel_path_to_root }}configuration.html">Configuration</a>
<a class="dropdown-item" href="{{ rel_path_to_root }}monitoring.html">Monitoring</a>
<a class="dropdown-item" href="{{ rel_path_to_root }}tuning.html">Tuning Guide</a>
<a class="dropdown-item" href="{{ rel_path_to_root }}job-scheduling.html">Job Scheduling</a>
<a class="dropdown-item" href="{{ rel_path_to_root }}security.html">Security</a>
<a class="dropdown-item" href="{{ rel_path_to_root }}hardware-provisioning.html">Hardware Provisioning</a>
<a class="dropdown-item" href="{{ rel_path_to_root }}migration-guide.html">Migration Guide</a>
<div class="dropdown-divider"></div>
<a class="dropdown-item" href="building-spark.html">Building Spark</a>
<a class="dropdown-item" href="{{ rel_path_to_root }}building-spark.html">Building Spark</a>
<a class="dropdown-item" href="https://spark.apache.org/contributing.html">Contributing to Spark</a>
<a class="dropdown-item" href="https://spark.apache.org/third-party-projects.html">Third Party Projects</a>
</div>
Expand All @@ -137,11 +143,11 @@ <h1 style="max-width: 680px;">Apache Spark - A Unified engine for large-scale da
It provides high-level APIs in Java, Scala, Python and R,
and an optimized engine that supports general execution graphs.
It also supports a rich set of higher-level tools including
<a href="sql-programming-guide.html">Spark SQL</a> for SQL and structured data processing,
<a href="api/python/getting_started/quickstart_ps.html">pandas API on Spark</a> for pandas workloads,
<a href="ml-guide.html">MLlib</a> for machine learning,
<a href="graphx-programming-guide.html">GraphX</a> for graph processing,
and <a href="structured-streaming-programming-guide.html">Structured Streaming</a>
<a href="{{ rel_path_to_root }}sql-programming-guide.html">Spark SQL</a> for SQL and structured data processing,
<a href="{{ rel_path_to_root }}api/python/getting_started/quickstart_ps.html">pandas API on Spark</a> for pandas workloads,
<a href="{{ rel_path_to_root }}ml-guide.html">MLlib</a> for machine learning,
<a href="{{ rel_path_to_root }}graphx-programming-guide.html">GraphX</a> for graph processing,
and <a href="{{ rel_path_to_root }}streaming/index.html">Structured Streaming</a>
for incremental computation and stream processing.
</div>
</div>
Expand All @@ -150,12 +156,13 @@ <h1 style="max-width: 680px;">Apache Spark - A Unified engine for large-scale da
{% endif %}

<div class="container">

{% if page.url contains "/ml" or page.url contains "/sql" or page.url contains "migration-guide.html" %}
{% if page.url contains "/ml" or page.url contains "/sql" or page.url contains "/streaming/" or page.url contains "migration-guide.html" %}
{% if page.url contains "migration-guide.html" %}
{% include nav-left-wrapper-migration.html nav-migration=site.data.menu-migration %}
{% elsif page.url contains "/ml" %}
{% include nav-left-wrapper-ml.html nav-mllib=site.data.menu-mllib nav-ml=site.data.menu-ml %}
{% elsif page.url contains "/streaming/" %}
{% include nav-left-wrapper-streaming.html nav-streaming=site.data.menu-streaming %}
{% else %}
{% include nav-left-wrapper-sql.html nav-sql=site.data.menu-sql %}
{% endif %}
Expand Down Expand Up @@ -191,8 +198,8 @@ <h1 class="title">{{ page.title }}</h1>
crossorigin="anonymous"></script>
<script src="https://code.jquery.com/jquery.js"></script>

<script src="js/vendor/anchor.min.js"></script>
<script src="js/main.js"></script>
<script src="/js/vendor/anchor.min.js"></script>
<script src="/js/main.js"></script>

<script type="text/javascript" src="https://cdn.jsdelivr.net/npm/docsearch.js@2/dist/cdn/docsearch.min.js"></script>
<script type="text/javascript">
Expand Down
2 changes: 1 addition & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -106,7 +106,7 @@ options for deployment:
* [Quick Start](quick-start.html): a quick introduction to the Spark API; start here!
* [RDD Programming Guide](rdd-programming-guide.html): overview of Spark basics - RDDs (core but old API), accumulators, and broadcast variables
* [Spark SQL, Datasets, and DataFrames](sql-programming-guide.html): processing structured data with relational queries (newer API than RDDs)
* [Structured Streaming](structured-streaming-programming-guide.html): processing structured data streams with relation queries (using Datasets and DataFrames, newer API than DStreams)
* [Structured Streaming](./streaming/index.html): processing structured data streams with relation queries (using Datasets and DataFrames, newer API than DStreams)
* [Spark Streaming](streaming-programming-guide.html): processing data streams using DStreams (old API)
* [MLlib](ml-guide.html): applying machine learning algorithms
* [GraphX](graphx-programming-guide.html): processing graphs
Expand Down
2 changes: 1 addition & 1 deletion docs/migration-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ for users to migrate effectively.

* [Spark Core](core-migration-guide.html)
* [SQL, Datasets, and DataFrame](sql-migration-guide.html)
* [Structured Streaming](ss-migration-guide.html)
* [Structured Streaming](./streaming/ss-migration-guide.html)
* [MLlib (Machine Learning)](ml-migration-guide.html)
* [PySpark (Python on Spark)](pyspark-migration-guide.html)
* [SparkR (R on Spark)](sparkr-migration-guide.html)
2 changes: 1 addition & 1 deletion docs/sparkr.md
Original file line number Diff line number Diff line change
Expand Up @@ -660,7 +660,7 @@ The following example shows how to save/load a MLlib model by SparkR.

# Structured Streaming

SparkR supports the Structured Streaming API. Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. For more information see the R API on the [Structured Streaming Programming Guide](structured-streaming-programming-guide.html)
SparkR supports the Structured Streaming API. Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. For more information see the R API on the [Structured Streaming Programming Guide](./streaming/index.html).

# Apache Arrow in SparkR

Expand Down
Loading

0 comments on commit 493ca98

Please sign in to comment.