[Docs] Update spark-getting-started docs page to make the example valid #11923

nickdelnano · 2025-01-07T22:35:38Z

The Spark Getting Started docs page has intro Spark examples but they reference tables and columns that do not exist. This is one of the first docs pages that new Iceberg users will see ... having a correct example that someone can run is helpful to them.

I found this as I was reading the project's tests and saw this TODO marked in SmokeTest.java:

iceberg/spark/v3.3/spark-runtime/src/integration/java/org/apache/iceberg/spark/SmokeTest.java

Lines 42 to 44 in 67e084c

    
           // Run through our Doc's Getting Started Example 
        
           // TODO Update doc example so that it can actually be run, modifications were required for this 
        
           // test suite to run

I've updated the docs in line with the test cases, and also made a minor change to the example a bit to make it more clear - each MERGE sets a unique data value for each id.

Testing

Ran the tests modified in this PR - they pass
Built the site locally as documented here and loaded the modified page

nickdelnano · 2025-01-07T22:38:07Z

docs/docs/spark-getting-started.md

-MERGE INTO local.db.target t USING (SELECT * FROM updates) u ON t.id = u.id
-WHEN MATCHED THEN UPDATE SET t.count = t.count + u.count


before this PR, updates does not exist, nor does t.count or u.count

nickdelnano · 2025-01-07T22:38:38Z

docs/docs/spark-getting-started.md

@@ -160,7 +163,7 @@ This type conversion table describes how Spark types are converted to the Iceber
 | map             | map                        |       |

 !!! info
-    The table is based on representing conversion during creating table. In fact, broader supports are applied on write. Here're some points on write:
+    The table is based on type conversions during table creation. Broader type conversions are applied on write:


small grammar improvements

nit: the paragraph before mentions the table is for both create and write while this sentence says its only based on create.

thanks, updated

nickdelnano · 2025-01-07T22:39:09Z

docs/docs/spark-getting-started.md

@@ -77,21 +77,24 @@ Once your table is created, insert data using [`INSERT INTO`](spark-writes.md#in

 ```sql
 INSERT INTO local.db.table VALUES (1, 'a'), (2, 'b'), (3, 'c');
-INSERT INTO local.db.table SELECT id, data FROM source WHERE length(data) = 1;


This statement does not add much to the simple example here, remove it

nickdelnano · 2025-01-07T22:39:36Z

spark/v3.3/spark-runtime/src/integration/java/org/apache/iceberg/spark/SmokeTest.java

@@ -66,25 +64,25 @@ public void testGettingStarted() throws IOException {
    sql(
        "CREATE TABLE updates (id bigint, data string) USING parquet LOCATION '%s'",
        temp.newFolder());
-    sql("INSERT INTO updates VALUES (1, 'x'), (2, 'x'), (4, 'z')");
+    sql("INSERT INTO updates VALUES (1, 'x'), (2, 'y'), (4, 'z')");


to make the example more interesting to users, set unique values of data so that the function of MERGE is more clear in the result

i like the original example since it hits all branch of the merge into statement.
also it'd be nice to keep track of table state in the comment

the example still hits all branches of merge

id 1 and 2 are updated

id 3, 10, 11 are unchanged

id 4 does not match and is inserted

the change here is to provide a unique data value for results as that helps to explain the example in the docs better

nickdelnano · 2025-01-07T22:57:53Z

Hi @kevinjqliu - I saw that you're a committer and recently looked at this doc page in #11845. Could you review this PR?

kevinjqliu

Thanks for improving the getting started guide! I've added a few comments

kevinjqliu · 2025-01-08T00:59:56Z

docs/docs/spark-getting-started.md

@@ -160,7 +163,7 @@ This type conversion table describes how Spark types are converted to the Iceber
 | map             | map                        |       |

 !!! info
-    The table is based on representing conversion during creating table. In fact, broader supports are applied on write. Here're some points on write:
+    The table is based on type conversions during table creation. Broader type conversions are applied on write:


nit: the paragraph before mentions the table is for both create and write while this sentence says its only based on create.

spark/v3.3/spark-runtime/src/integration/java/org/apache/iceberg/spark/SmokeTest.java

kevinjqliu · 2025-01-08T01:31:45Z

spark/v3.3/spark-runtime/src/integration/java/org/apache/iceberg/spark/SmokeTest.java

@@ -66,25 +64,25 @@ public void testGettingStarted() throws IOException {
    sql(
        "CREATE TABLE updates (id bigint, data string) USING parquet LOCATION '%s'",
        temp.newFolder());
-    sql("INSERT INTO updates VALUES (1, 'x'), (2, 'x'), (4, 'z')");
+    sql("INSERT INTO updates VALUES (1, 'x'), (2, 'y'), (4, 'z')");


i like the original example since it hits all branch of the merge into statement.
also it'd be nice to keep track of table state in the comment

kevinjqliu · 2025-01-08T01:36:20Z

docs/docs/spark-getting-started.md

-MERGE INTO local.db.target t USING (SELECT * FROM updates) u ON t.id = u.id
-WHEN MATCHED THEN UPDATE SET t.count = t.count + u.count
+CREATE TABLE local.db.updates (id bigint, data string) USING iceberg;
+INSERT INTO local.db.updates VALUES (1, 'x'), (2, 'y'), (4, 'z');


same as below, lets update the values so it will hit all branch of the merge into statement.

nit: and also add values as comment to track the table state

about merge branches, commented here https://github.com/apache/iceberg/pull/11923/files?diff=unified&w=0#r1906122418

The table states are straightforward until after the MERGE query (1 insert per table). I have added the table state as a comment after MERGE only. Otherwise there is a lot of duplication. Let me know your thoughts.

looks good, thanks!

nickdelnano

@kevinjqliu thanks for the review. I replied to your comments and added an updated screenshot in the description.

spark/v3.3/spark-runtime/src/integration/java/org/apache/iceberg/spark/SmokeTest.java

nickdelnano · 2025-01-10T21:57:30Z

spark/v3.3/spark-runtime/src/integration/java/org/apache/iceberg/spark/SmokeTest.java

@@ -66,25 +64,25 @@ public void testGettingStarted() throws IOException {
    sql(
        "CREATE TABLE updates (id bigint, data string) USING parquet LOCATION '%s'",
        temp.newFolder());
-    sql("INSERT INTO updates VALUES (1, 'x'), (2, 'x'), (4, 'z')");
+    sql("INSERT INTO updates VALUES (1, 'x'), (2, 'y'), (4, 'z')");


the example still hits all branches of merge

id 1 and 2 are updated

id 3, 10, 11 are unchanged

id 4 does not match and is inserted

the change here is to provide a unique data value for results as that helps to explain the example in the docs better

nickdelnano · 2025-01-10T21:57:45Z

docs/docs/spark-getting-started.md

@@ -160,7 +163,7 @@ This type conversion table describes how Spark types are converted to the Iceber
 | map             | map                        |       |

 !!! info
-    The table is based on representing conversion during creating table. In fact, broader supports are applied on write. Here're some points on write:
+    The table is based on type conversions during table creation. Broader type conversions are applied on write:


thanks, updated

nickdelnano · 2025-01-10T22:03:26Z

docs/docs/spark-getting-started.md

-MERGE INTO local.db.target t USING (SELECT * FROM updates) u ON t.id = u.id
-WHEN MATCHED THEN UPDATE SET t.count = t.count + u.count
+CREATE TABLE local.db.updates (id bigint, data string) USING iceberg;
+INSERT INTO local.db.updates VALUES (1, 'x'), (2, 'y'), (4, 'z');


about merge branches, commented here https://github.com/apache/iceberg/pull/11923/files?diff=unified&w=0#r1906122418

The table states are straightforward until after the MERGE query (1 insert per table). I have added the table state as a comment after MERGE only. Otherwise there is a lot of duplication. Let me know your thoughts.

kevinjqliu

LGTM, I've tested the getting started examples locally. The SmokeTest changes also align with the getting started page.

Thank you for improving the documentation!

kevinjqliu · 2025-01-11T04:12:18Z

docs/docs/spark-getting-started.md

-MERGE INTO local.db.target t USING (SELECT * FROM updates) u ON t.id = u.id
-WHEN MATCHED THEN UPDATE SET t.count = t.count + u.count
+CREATE TABLE local.db.updates (id bigint, data string) USING iceberg;
+INSERT INTO local.db.updates VALUES (1, 'x'), (2, 'y'), (4, 'z');


looks good, thanks!

kevinjqliu · 2025-01-11T04:18:54Z

There's an error in CI, looks like you need to run the linter

Execution failed for task ':iceberg-spark:iceberg-spark-runtime-3.3_2.12:spotlessJavaCheck'.

nickdelnano · 2025-01-12T07:18:36Z

@kevinjqliu oop, sorry about that. Ran the linter 98c99f1

 ./gradlew spotlessApply

BUILD SUCCESSFUL in 1s
69 actionable tasks: 2 executed, 67 up-to-date

kevinjqliu · 2025-01-12T19:32:58Z

Great! Thank you. I'll post this in Iceberg Slack's #documentation channel to get more eyes on it.

kevinjqliu · 2025-01-13T17:32:13Z

looks like we'd have to rerun spotless check again

  Run './gradlew :iceberg-spark:iceberg-spark-runtime-3.3_2.12:spotlessApply' to fix these violations.

RussellSpitzer

LGTM, as a future improvement can we find some way to extract the code from the docs rather than recreating it in the test suite? I don't like that they can drift. It may have to be some gradle thing where we parse out all the code blocks and then compile them into a class that is run.

nickdelnano · 2025-01-24T16:57:17Z

sorry for the delay here - I was on vacation for a bit

a9cbadd should fix the tests

./gradlew spotlessApply -DallModules

nickdelnano added 2 commits January 7, 2025 14:26

Update SmokeTest tests to reflect Getting Started Spark docs

4651aab

Update Getting Started Spark docs to have valid examples

3c97949

github-actions bot added spark docs labels Jan 7, 2025

nickdelnano commented Jan 7, 2025

View reviewed changes

nickdelnano marked this pull request as ready for review January 7, 2025 22:44

kevinjqliu reviewed Jan 8, 2025

View reviewed changes

Review feedback on Getting Started Spark docs

ea5951c

nickdelnano commented Jan 10, 2025

View reviewed changes

kevinjqliu approved these changes Jan 11, 2025

View reviewed changes

run linter

98c99f1

RussellSpitzer approved these changes Jan 21, 2025

View reviewed changes

run spotless on all Spark versions

a9cbadd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Docs] Update spark-getting-started docs page to make the example valid #11923

[Docs] Update spark-getting-started docs page to make the example valid #11923

nickdelnano commented Jan 7, 2025 •

edited

Loading

nickdelnano Jan 7, 2025

nickdelnano Jan 7, 2025

kevinjqliu Jan 8, 2025

nickdelnano Jan 10, 2025

nickdelnano Jan 7, 2025

nickdelnano Jan 7, 2025

kevinjqliu Jan 8, 2025

nickdelnano Jan 10, 2025

nickdelnano commented Jan 7, 2025

kevinjqliu left a comment

kevinjqliu Jan 8, 2025

kevinjqliu Jan 8, 2025

kevinjqliu Jan 8, 2025

nickdelnano Jan 10, 2025

kevinjqliu Jan 11, 2025

nickdelnano left a comment

nickdelnano Jan 10, 2025

nickdelnano Jan 10, 2025

nickdelnano Jan 10, 2025

kevinjqliu left a comment

kevinjqliu Jan 11, 2025

kevinjqliu commented Jan 11, 2025

nickdelnano commented Jan 12, 2025

kevinjqliu commented Jan 12, 2025

kevinjqliu commented Jan 13, 2025

RussellSpitzer left a comment

nickdelnano commented Jan 24, 2025

	// Run through our Doc's Getting Started Example
	// TODO Update doc example so that it can actually be run, modifications were required for this
	// test suite to run

		MERGE INTO local.db.target t USING (SELECT * FROM updates) u ON t.id = u.id
		WHEN MATCHED THEN UPDATE SET t.count = t.count + u.count

[Docs] Update spark-getting-started docs page to make the example valid #11923

Are you sure you want to change the base?

[Docs] Update spark-getting-started docs page to make the example valid #11923

Conversation

nickdelnano commented Jan 7, 2025 • edited Loading

Testing

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nickdelnano commented Jan 7, 2025

kevinjqliu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nickdelnano left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kevinjqliu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kevinjqliu commented Jan 11, 2025

nickdelnano commented Jan 12, 2025

kevinjqliu commented Jan 12, 2025

kevinjqliu commented Jan 13, 2025

RussellSpitzer left a comment

Choose a reason for hiding this comment

nickdelnano commented Jan 24, 2025

nickdelnano commented Jan 7, 2025 •

edited

Loading