Add table changes constructor #505

OussamaSaoudi-db · 2024-11-18T23:33:00Z

What changes are proposed in this pull request?

This PR introduces the TableChanges struct which represents a Change Data Feed scan. TableChanges is constructed from a Table, and performs 2 protocol and metadata scans.

The first is a P&M scan from start version, and ensures that CDF is enabled at the beginning version.
The second P&M scan is for the end version. This one is used to extract the schema at the end version and ensure that the final version has CDF enabled.

I also add the logic for converting the end version's schema into the cdf schema.

Note that the behaviour to fail early in table changes constructor aligns with spark's behaviour. Only the CDF range is returned in spark's error. No specific commit version that causes the failure is provided.

How was this change tested?

Ensure that TableChanges::try_new checks the start and end version
Ensure that the schema at the start and end versions are the same
Ensure that the table_changes.schema() method returns the CDF schema.

codecov · 2024-11-19T00:20:20Z

Codecov Report

Attention: Patch coverage is 87.05036% with 18 lines in your changes missing coverage. Please review.

Project coverage is 80.31%. Comparing base (e450c05) to head (96a4b71).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
kernel/src/table_changes/mod.rs	86.06%	12 Missing and 5 partials ⚠️
ffi/src/lib.rs	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #505      +/-   ##
==========================================
+ Coverage   80.24%   80.31%   +0.06%     
==========================================
  Files          61       62       +1     
  Lines       13402    13541     +139     
  Branches    13402    13541     +139     
==========================================
+ Hits        10755    10876     +121     
- Misses       2093     2106      +13     
- Partials      554      559       +5

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚨 Try these New Features:

Flaky Tests Detection - Detect and resolve failed and flaky tests

nicklan · 2024-11-19T00:41:25Z

kernel/src/table.rs

+    pub fn table_changes(
+        &self,
+        engine: &dyn Engine,
+        start_version: Version,


why not impl Into<Version>?

I hadn't considered that. Tho this change seems to cause issues:

154 | let table_changes_res = table.table_changes(engine.as_ref(), 3, 4); | ------------- ^ the trait `From<i32>` is not implemented for `u64`, which is required by `{integer}: Into<u64>

Integers are treated as i32 by default, and i32 can't be converted into u64. By using just start_version: Version, seems that the compiler treats it as a u64 from the get go.

Out of curiosity, what situation would produce an Into<Version> that is not already Version, that we need the fancy arg passing?

Yeah, maybe my question should actually have been, "why not Option<Version> for end_version"? When do we need the into there? I just erred on the side of flexibility

Oh using end_version: impl Into<Option<Version>> lets us pass in either Version or Option<Version> to the function. This is a trick I learned from @scovich
So the following are legal:
table.table_changes(engine.as_ref(), 3, 4);
table.table_changes(engine.as_ref(), Some(3), 4);
table.table_changes(engine.as_ref(), None, 4);

Ahh cool, that's nice. Sorry for the long aside :)

nicklan · 2024-11-19T00:43:38Z

kernel/src/table_changes/mod.rs

+                .get(ENABLE_CDF_FLAG)
+                .is_some_and(|val| val == "true")
+        };
+        if !is_cdf_enabled(&start_snapshot) {


I think we need to check at every point along the way, not just start and end.

True, we could leave that till later. I was hoping to do some checking at this stage since we can catch that error earlier. Should I just remove this check?

No I think this is okay, as long as we somehow check at each point. I think if it's cheap to error early we should

No extra cost since we needed to get both snapshots anyway.

can we add comments here for clarity? (e.g. "we check start/end to fail early if needed but at each point yielding CDF batches we do schema compat check" or something like that)

added comments addressing enable cdf flag, schema, and protocol.

nicklan · 2024-11-19T00:45:06Z

kernel/src/table.rs

+    /// Create a [`TableChanges`] to get a change data feed for the table between `start_version`,
+    /// and `end_version`. If no `end_version` is supplied, the latest version will be used as the
+    /// `end_version`.
+    pub fn table_changes(


Don't we need to be able to specify:

A schema? Does CDF always return the full table schema?

A predicate? Can't you get a CDF with a predicate?

This is usually done in the Scan Builder. The plan is to specify the schema and predicate when building a TableChangesScan.

and i suppose there are no optimizations to be done here with that information? If yes we likely would want to propagate that information here?

since this is the main 'entrypoint' API for users to interact with reading CDF can we add some docs on semantics and ideally examples too? important things to call out: (1) require that CDF is enabled for the whole range, (2) require the same schema for the whole range (for now!), (3) how do i scan this thing?

(i'm probably forgetting other bits of semantics so don't take the above as exhaustive)

OussamaSaoudi-db · 2024-11-19T01:49:39Z

@nicklan @zachschuermann I wanted your opinion on the TableChanges schema. I added the code to generate the CDF schema. I think it makes sense to do it here instead of in the scan builder. This is because the user would likely project the schema from the table, and use that as the scan schema.

let schema = table_changes.schema().project(...);
let scan_builder = scan_builder.with_schema(schema);

OussamaSaoudi-db · 2024-11-19T01:55:41Z

Also I wanted to flag this: I have the field end_snapshot, but we are really interested in column mapping mode, partition columns. We would additionally need the schema if we decide to create the cdf schema in the scan builder (as opposed to the current implementation).

I'm leaning towards just storing column mapping mode and partition columns directly in TableChanges. Do you foresee that we'll need to store the end version's metadata, protocol, or other fields?

scovich · 2024-11-19T13:13:00Z

Also I wanted to flag this: I have the field end_snapshot, but we are really interested in column mapping mode, partition columns. We would additionally need the schema if we decide to create the cdf schema in the scan builder (as opposed to the current implementation).

I'm leaning towards just storing column mapping mode and partition columns directly in TableChanges. Do you foresee that we'll need to store the end version's metadata, protocol, or other fields?

AFAIK, streaming in delta-spark does a complicated dance to process metadata changes in a commit separately from the data changes of that commit, and an incompatible metadata change causes the stream to restart. We should probably use that code as inspiration so we don't reinvent the wheel? (but it's also somewhat messy code due to its organic development over years, so we should probably not copy blindly)

nicklan

cool, mostly fine, just one small thing

nicklan · 2024-11-21T01:08:44Z

kernel/src/table.rs

+    pub fn table_changes(
+        &self,
+        engine: &dyn Engine,
+        start_version: Version,


Yeah, maybe my question should actually have been, "why not Option<Version> for end_version"? When do we need the into there? I just erred on the side of flexibility

nicklan · 2024-11-21T01:09:19Z

kernel/src/table_changes/mod.rs

+                .get(ENABLE_CDF_FLAG)
+                .is_some_and(|val| val == "true")
+        };
+        if !is_cdf_enabled(&start_snapshot) {


No I think this is okay, as long as we somehow check at each point. I think if it's cheap to error early we should

nicklan · 2024-11-21T01:12:45Z

kernel/src/table_changes/mod.rs

+    pub log_segment: LogSegment,
+    table_root: Url,
+    end_snapshot: Snapshot,
+    start_version: Version,


Just checking, we won't have to re-find the start snapshot when we actually go to return results right?

No, we don't. Henceforth all we are interested in are the commits in the LogSegment, and the schema we got from the end_snapshot.

nicklan · 2024-11-21T01:13:37Z

kernel/src/table_changes/mod.rs

+        }
+        if start_snapshot.schema() != end_snapshot.schema() {
+            return Err(Error::generic(
+                "Failed to build TableChanges: Start and end version schemas are different.",


let's put the schemas in the output to help with debugging

Added to the print logs. How's it look?l

zachschuermann · 2024-11-21T19:53:52Z

Also I wanted to flag this: I have the field end_snapshot, but we are really interested in column mapping mode, partition columns. We would additionally need the schema if we decide to create the cdf schema in the scan builder (as opposed to the current implementation).

I'm leaning towards just storing column mapping mode and partition columns directly in TableChanges. Do you foresee that we'll need to store the end version's metadata, protocol, or other fields?

let's do the minimal thing for now and only include the data we need immediately

zachschuermann

(copying from slack thread so we don't lose it)

You mentioned

Note that the behaviour to fail early in table changes constructor aligns with spark's behaviour. Only the CDF range is returned in spark's error. No specific commit version that causes the failure is provided.

can we add that to PR description and/or docs?

generally PR looks good, awesome work! mostly nits and will come back for final review after those are addressed

zachschuermann · 2024-11-21T19:56:58Z

kernel/src/table.rs

+    /// Create a [`TableChanges`] to get a change data feed for the table between `start_version`,
+    /// and `end_version`. If no `end_version` is supplied, the latest version will be used as the
+    /// `end_version`.
+    pub fn table_changes(


and i suppose there are no optimizations to be done here with that information? If yes we likely would want to propagate that information here?

kernel/src/table_changes/mod.rs

zachschuermann · 2024-11-21T20:01:13Z

kernel/src/table.rs

+    /// Create a [`TableChanges`] to get a change data feed for the table between `start_version`,
+    /// and `end_version`. If no `end_version` is supplied, the latest version will be used as the
+    /// `end_version`.
+    pub fn table_changes(


since this is the main 'entrypoint' API for users to interact with reading CDF can we add some docs on semantics and ideally examples too? important things to call out: (1) require that CDF is enabled for the whole range, (2) require the same schema for the whole range (for now!), (3) how do i scan this thing?

(i'm probably forgetting other bits of semantics so don't take the above as exhaustive)

ffi/src/lib.rs

kernel/src/error.rs

zachschuermann · 2024-11-21T20:34:24Z

kernel/src/table_changes/mod.rs

+        self.start_version
+    }
+    /// The end version of the `TableChanges`. If no end_version was specified in
+    /// [`TableChanges::try_new`], this returns the newest version as of the call to `try_new`.


Suggested change

/// [`TableChanges::try_new`], this returns the newest version as of the call to `try_new`.

/// [`TableChanges::try_new`], this returns the newest version as of the call to [`try_new`].

This causes issues with cargo doc, and I think it's a lot of visual clutter to repeat the full [`TableChanges::try_new`]

kernel/src/table_changes/mod.rs

scovich

LGTM

kernel/src/error.rs

scovich · 2024-11-21T23:20:46Z

kernel/src/table_changes/mod.rs

+        StructField::new("_change_type", DataType::STRING, false),
+        StructField::new("_commit_version", DataType::LONG, false),
+        StructField::new("_commit_timestamp", DataType::TIMESTAMP, false),


Side note: delta-spark does some hand-wringing about table schemas that already provide columns with these names. I think the solution was to block enabling CDF on such tables, and to block creating columns with those names on tables that are already CDF-enabled? (both are writer issues, not reader, but it's prob worth tracking)

Tracked in #524. Thx!

scovich · 2024-11-21T23:24:48Z

kernel/src/table_changes/mod.rs

+        let start_snapshot =
+            Snapshot::try_new(table_root.as_url().clone(), engine, Some(start_version))?;
+        let end_snapshot = Snapshot::try_new(table_root.as_url().clone(), engine, end_version)?;


aside: Relating to the optimization we've discussed a few times -- the end snapshot should be able to use the first snapshot as a starting point for listing (and P&M), if the two versions aren't too far apart?

Yeah I would like that to be the case eventually.

added a note to #489

scovich · 2024-11-21T23:27:20Z

kernel/src/table_changes/mod.rs

+                .metadata()
+                .configuration
+                .get(ENABLE_CDF_FLAG)
+                .is_some_and(|val| val == "true")


I guess this will simplify once #453 merges?

scovich · 2024-11-21T23:28:27Z

kernel/src/table_changes/mod.rs

+        } else if !is_cdf_enabled(&end_snapshot) {
+            return Err(Error::table_changes_disabled(end_snapshot.version()));
+        }
+        if start_snapshot.schema() != end_snapshot.schema() {


aside: Technically users can stuff whatever metadata they want into the schema fields; should we track an item to ignore unknown metadata entries when comparing schemas?

Yeah we could do that. I think schema checking needs to be changed a lot anyway to support at least adding columns and changing nullability, so there's more work to do in that department.

Added link to issue to track schema compatibility checks. #523

scovich · 2024-11-21T23:29:39Z

kernel/src/table_changes/mod.rs

+                .schema()
+                .fields()
+                .cloned()
+                .chain(CDF_FIELDS.clone()),


The CDF fields are generated columns right? (not read directly from parquet)?

Yes, correct. I was planning on making them a special case only in CDF code. If you feel that we can legitimately treat these as generated column, we could add a new column type ColumnType::GeneratedColumn.

ffi/src/lib.rs

kernel/src/table_changes/mod.rs

zachschuermann · 2024-11-22T01:19:41Z

kernel/src/table_changes/mod.rs

+        let start_snapshot =
+            Snapshot::try_new(table_root.as_url().clone(), engine, Some(start_version))?;
+        let end_snapshot = Snapshot::try_new(table_root.as_url().clone(), engine, end_version)?;


added a note to #489

nicklan · 2024-11-22T01:33:13Z

kernel/src/table.rs

+    pub fn table_changes(
+        &self,
+        engine: &dyn Engine,
+        start_version: Version,


Ahh cool, that's nice. Sorry for the long aside :)

zachschuermann

LGTM ship ship ship

## What changes are proposed in this pull request?  This removes files that were accidentally added in prior PRs that were un-reviewed in #505 and #506.  ## How was this change tested?  Co-authored-by: Zach Schuermann <[email protected]>

github-actions bot assigned OussamaSaoudi-db Nov 18, 2024

github-actions bot added the breaking-change Change that will require a version bump label Nov 18, 2024

OussamaSaoudi-db marked this pull request as ready for review November 19, 2024 00:11

nicklan reviewed Nov 19, 2024

View reviewed changes

OussamaSaoudi-db requested review from nicklan and zachschuermann November 19, 2024 01:57

nicklan reviewed Nov 21, 2024

View reviewed changes

OussamaSaoudi-db requested review from nicklan and scovich November 21, 2024 17:51

zachschuermann reviewed Nov 21, 2024

View reviewed changes

OussamaSaoudi-db added 10 commits November 21, 2024 14:49

Add table changes constructor

e993194

Unnest imports in test

56ac31c

Fix doc

df40d8a

Fix docs

170f6f3

more docs

4ce9a9d

Fix nested import in error

4524e75

Add test for schema, asserting it is equal

23962b6

fix failing test

8d43bc5

Add table changes schema

073ae49

add schema to error message

6d972a6

OussamaSaoudi-db mentioned this pull request Nov 21, 2024

Implement Builder for Scans on TableChanges #521

Merged

scovich approved these changes Nov 21, 2024

View reviewed changes

Address pr comments

d3eb161

OussamaSaoudi-db force-pushed the table_changes_constructor_2 branch from a6f13c4 to d3eb161 Compare November 22, 2024 01:09

zachschuermann reviewed Nov 22, 2024

View reviewed changes

ffi/src/lib.rs Outdated Show resolved Hide resolved

zachschuermann reviewed Nov 22, 2024

View reviewed changes

nicklan approved these changes Nov 22, 2024

View reviewed changes

OussamaSaoudi-db added 6 commits November 21, 2024 17:39

Add docs and doctest

b1e1358

Make logsegment pubcrate

6739f09

address more nits

d2e004b

Merge branch 'main' into table_changes_constructor_2

af19183

fix doc string reference

e1347da

more do fix

87d41c4

OussamaSaoudi-db requested a review from zachschuermann November 22, 2024 02:04

Added link to issue for schema compatiblity

96a4b71

zachschuermann approved these changes Nov 22, 2024

View reviewed changes

OussamaSaoudi-db merged commit d146b80 into delta-io:main Nov 22, 2024
20 checks passed

OussamaSaoudi-db deleted the table_changes_constructor_2 branch November 22, 2024 02:37

OussamaSaoudi-db mentioned this pull request Nov 22, 2024

Revert kernel/src/table_changes/scan.rs #529

Merged

	/// [`TableChanges::try_new`], this returns the newest version as of the call to `try_new`.
	/// [`TableChanges::try_new`], this returns the newest version as of the call to [`try_new`].

Add table changes constructor #505

Add table changes constructor #505

Conversation

OussamaSaoudi-db commented Nov 18, 2024 • edited Loading

What changes are proposed in this pull request?

How was this change tested?

codecov bot commented Nov 19, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

OussamaSaoudi-db commented Nov 19, 2024

OussamaSaoudi-db commented Nov 19, 2024

scovich commented Nov 19, 2024

nicklan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zachschuermann commented Nov 21, 2024

zachschuermann left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scovich left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

OussamaSaoudi-db Nov 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zachschuermann left a comment

Choose a reason for hiding this comment

OussamaSaoudi-db commented Nov 18, 2024 •

edited

Loading

codecov bot commented Nov 19, 2024 •

edited

Loading

OussamaSaoudi-db Nov 22, 2024 •

edited

Loading