Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Docs] Update spark-getting-started docs page to make the example valid #11923

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

nickdelnano
Copy link

@nickdelnano nickdelnano commented Jan 7, 2025

The Spark Getting Started docs page has intro Spark examples but they reference tables and columns that do not exist. This is one of the first docs pages that new Iceberg users will see ... having a correct example that someone can run is helpful to them.

I found this as I was reading the project's tests and saw this TODO marked in SmokeTest.java:

// Run through our Doc's Getting Started Example
// TODO Update doc example so that it can actually be run, modifications were required for this
// test suite to run

I've updated the docs in line with the test cases, and also made a minor change to the example a bit to make it more clear - each MERGE sets a unique data value for each id.

Testing

  • Ran the tests modified in this PR - they pass
  • Built the site locally as documented here and loaded the modified page
spark-getting-started

Comment on lines -86 to -87
MERGE INTO local.db.target t USING (SELECT * FROM updates) u ON t.id = u.id
WHEN MATCHED THEN UPDATE SET t.count = t.count + u.count
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before this PR, updates does not exist, nor does t.count or u.count

@@ -160,7 +163,7 @@ This type conversion table describes how Spark types are converted to the Iceber
| map | map | |

!!! info
The table is based on representing conversion during creating table. In fact, broader supports are applied on write. Here're some points on write:
The table is based on type conversions during table creation. Broader type conversions are applied on write:
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small grammar improvements

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the paragraph before mentions the table is for both create and write while this sentence says its only based on create.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, updated

@@ -77,21 +77,24 @@ Once your table is created, insert data using [`INSERT INTO`](spark-writes.md#in

```sql
INSERT INTO local.db.table VALUES (1, 'a'), (2, 'b'), (3, 'c');
INSERT INTO local.db.table SELECT id, data FROM source WHERE length(data) = 1;
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This statement does not add much to the simple example here, remove it

@@ -66,25 +64,25 @@ public void testGettingStarted() throws IOException {
sql(
"CREATE TABLE updates (id bigint, data string) USING parquet LOCATION '%s'",
temp.newFolder());
sql("INSERT INTO updates VALUES (1, 'x'), (2, 'x'), (4, 'z')");
sql("INSERT INTO updates VALUES (1, 'x'), (2, 'y'), (4, 'z')");
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to make the example more interesting to users, set unique values of data so that the function of MERGE is more clear in the result

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i like the original example since it hits all branch of the merge into statement.
also it'd be nice to keep track of table state in the comment

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the example still hits all branches of merge

  • id 1 and 2 are updated
  • id 3, 10, 11 are unchanged
  • id 4 does not match and is inserted

the change here is to provide a unique data value for results as that helps to explain the example in the docs better

@nickdelnano nickdelnano marked this pull request as ready for review January 7, 2025 22:44
@nickdelnano
Copy link
Author

Hi @kevinjqliu - I saw that you're a committer and recently looked at this doc page in #11845. Could you review this PR?

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for improving the getting started guide! I've added a few comments

@@ -160,7 +163,7 @@ This type conversion table describes how Spark types are converted to the Iceber
| map | map | |

!!! info
The table is based on representing conversion during creating table. In fact, broader supports are applied on write. Here're some points on write:
The table is based on type conversions during table creation. Broader type conversions are applied on write:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the paragraph before mentions the table is for both create and write while this sentence says its only based on create.

@@ -66,25 +64,25 @@ public void testGettingStarted() throws IOException {
sql(
"CREATE TABLE updates (id bigint, data string) USING parquet LOCATION '%s'",
temp.newFolder());
sql("INSERT INTO updates VALUES (1, 'x'), (2, 'x'), (4, 'z')");
sql("INSERT INTO updates VALUES (1, 'x'), (2, 'y'), (4, 'z')");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i like the original example since it hits all branch of the merge into statement.
also it'd be nice to keep track of table state in the comment

MERGE INTO local.db.target t USING (SELECT * FROM updates) u ON t.id = u.id
WHEN MATCHED THEN UPDATE SET t.count = t.count + u.count
CREATE TABLE local.db.updates (id bigint, data string) USING iceberg;
INSERT INTO local.db.updates VALUES (1, 'x'), (2, 'y'), (4, 'z');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as below, lets update the values so it will hit all branch of the merge into statement.

nit: and also add values as comment to track the table state

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

about merge branches, commented here https://github.com/apache/iceberg/pull/11923/files?diff=unified&w=0#r1906122418

The table states are straightforward until after the MERGE query (1 insert per table). I have added the table state as a comment after MERGE only. Otherwise there is a lot of duplication. Let me know your thoughts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, thanks!

Copy link
Author

@nickdelnano nickdelnano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevinjqliu thanks for the review. I replied to your comments and added an updated screenshot in the description.

@@ -66,25 +64,25 @@ public void testGettingStarted() throws IOException {
sql(
"CREATE TABLE updates (id bigint, data string) USING parquet LOCATION '%s'",
temp.newFolder());
sql("INSERT INTO updates VALUES (1, 'x'), (2, 'x'), (4, 'z')");
sql("INSERT INTO updates VALUES (1, 'x'), (2, 'y'), (4, 'z')");
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the example still hits all branches of merge

  • id 1 and 2 are updated
  • id 3, 10, 11 are unchanged
  • id 4 does not match and is inserted

the change here is to provide a unique data value for results as that helps to explain the example in the docs better

@@ -160,7 +163,7 @@ This type conversion table describes how Spark types are converted to the Iceber
| map | map | |

!!! info
The table is based on representing conversion during creating table. In fact, broader supports are applied on write. Here're some points on write:
The table is based on type conversions during table creation. Broader type conversions are applied on write:
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, updated

MERGE INTO local.db.target t USING (SELECT * FROM updates) u ON t.id = u.id
WHEN MATCHED THEN UPDATE SET t.count = t.count + u.count
CREATE TABLE local.db.updates (id bigint, data string) USING iceberg;
INSERT INTO local.db.updates VALUES (1, 'x'), (2, 'y'), (4, 'z');
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

about merge branches, commented here https://github.com/apache/iceberg/pull/11923/files?diff=unified&w=0#r1906122418

The table states are straightforward until after the MERGE query (1 insert per table). I have added the table state as a comment after MERGE only. Otherwise there is a lot of duplication. Let me know your thoughts.

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I've tested the getting started examples locally. The SmokeTest changes also align with the getting started page.

Thank you for improving the documentation!

MERGE INTO local.db.target t USING (SELECT * FROM updates) u ON t.id = u.id
WHEN MATCHED THEN UPDATE SET t.count = t.count + u.count
CREATE TABLE local.db.updates (id bigint, data string) USING iceberg;
INSERT INTO local.db.updates VALUES (1, 'x'), (2, 'y'), (4, 'z');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, thanks!

@kevinjqliu
Copy link
Contributor

There's an error in CI, looks like you need to run the linter

Execution failed for task ':iceberg-spark:iceberg-spark-runtime-3.3_2.12:spotlessJavaCheck'.

@nickdelnano
Copy link
Author

@kevinjqliu oop, sorry about that. Ran the linter 98c99f1

 ./gradlew spotlessApply

BUILD SUCCESSFUL in 1s
69 actionable tasks: 2 executed, 67 up-to-date

@kevinjqliu
Copy link
Contributor

Great! Thank you. I'll post this in Iceberg Slack's #documentation channel to get more eyes on it.

@kevinjqliu
Copy link
Contributor

looks like we'd have to rerun spotless check again

  Run './gradlew :iceberg-spark:iceberg-spark-runtime-3.3_2.12:spotlessApply' to fix these violations.

Copy link
Member

@RussellSpitzer RussellSpitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, as a future improvement can we find some way to extract the code from the docs rather than recreating it in the test suite? I don't like that they can drift. It may have to be some gradle thing where we parse out all the code blocks and then compile them into a class that is run.

@nickdelnano
Copy link
Author

sorry for the delay here - I was on vacation for a bit

a9cbadd should fix the tests

./gradlew spotlessApply -DallModules

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants