Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parse table metadata.configuration as TableProperties #453

Merged
merged 27 commits into from
Nov 25, 2024
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
ba8309e
checkpoint with serde but think i need to change that
zachschuermann Nov 1, 2024
9d4b599
rough draft serde for table props
zachschuermann Nov 5, 2024
1b7b193
make everything optional
zachschuermann Nov 5, 2024
02d50ee
errors, comments, cleanup
zachschuermann Nov 8, 2024
e4676d6
fix
zachschuermann Nov 8, 2024
9f8afa4
use new col name list parsing
zachschuermann Nov 8, 2024
ed2c10a
Merge remote-tracking branch 'upstream/main' into table-properties
zachschuermann Nov 13, 2024
42e6028
docs
zachschuermann Nov 18, 2024
f1b9a16
Merge remote-tracking branch 'upstream/main' into table-properties
zachschuermann Nov 18, 2024
82370b4
remove derive
zachschuermann Nov 18, 2024
00b9d8e
make deserializer work on hashmap ref
zachschuermann Nov 18, 2024
f748f87
fix column mapping mode check
zachschuermann Nov 19, 2024
af08092
testing, errors, docs, cleanup
zachschuermann Nov 19, 2024
4587794
cleanup
zachschuermann Nov 19, 2024
1e7d286
fix skipping dat test
zachschuermann Nov 20, 2024
bd9ac7a
address feedback, cleanup
zachschuermann Nov 21, 2024
fa48054
Merge branch 'main' into table-properties
zachschuermann Nov 21, 2024
ff78623
remove unused const
zachschuermann Nov 22, 2024
b667a15
no more serde
zachschuermann Nov 22, 2024
b3cdc61
cleanup
zachschuermann Nov 22, 2024
a891b52
Merge remote-tracking branch 'upstream/main' into table-properties
zachschuermann Nov 22, 2024
d8a2933
add back col mapping mode fn
zachschuermann Nov 22, 2024
d1ce73d
address ryan review
zachschuermann Nov 23, 2024
d8af98c
Merge branch 'main' into table-properties
zachschuermann Nov 25, 2024
6d1b466
use NonZero<u64>
zachschuermann Nov 25, 2024
f18b885
Merge remote-tracking branch 'refs/remotes/origin/table-properties' i…
zachschuermann Nov 25, 2024
437b8db
clippy
zachschuermann Nov 25, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 7 additions & 6 deletions kernel/src/table_properties.rs
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
//! table.

use std::collections::HashMap;
use std::num::NonZero;
use std::time::Duration;

use crate::expressions::ColumnName;
Expand Down Expand Up @@ -43,7 +44,7 @@ pub struct TableProperties {

/// Interval (expressed as number of commits) after which a new checkpoint should be created.
/// E.g. if checkpoint interval = 10, then a checkpoint should be written every 10 commits.
pub checkpoint_interval: Option<i64>,
pub checkpoint_interval: Option<NonZero<u64>>,

/// true for Delta Lake to write file statistics in checkpoints in JSON format for the stats column.
pub checkpoint_write_stats_as_json: Option<bool>,
Expand Down Expand Up @@ -114,7 +115,7 @@ pub struct TableProperties {

/// When delta.randomizeFilePrefixes is set to true, the number of characters that Delta
/// generates for random prefixes.
pub random_prefix_length: Option<i64>,
pub random_prefix_length: Option<NonZero<u64>>,

/// The shortest duration within which new snapshots will retain transaction identifiers (for
/// example, SetTransactions). When a new snapshot sees a transaction identifier older than or
Expand All @@ -124,7 +125,7 @@ pub struct TableProperties {

/// The target file size in bytes or higher units for file tuning. For example, 104857600
/// (bytes) or 100mb.
pub target_file_size: Option<i64>,
pub target_file_size: Option<NonZero<u64>>,

/// The target file size in bytes or higher units for file tuning. For example, 104857600
/// (bytes) or 100mb.
Expand Down Expand Up @@ -273,7 +274,7 @@ mod tests {
append_only: Some(true),
optimize_write: Some(true),
auto_compact: Some(true),
checkpoint_interval: Some(101),
checkpoint_interval: Some(NonZero::new(101).unwrap()),
checkpoint_write_stats_as_json: Some(true),
checkpoint_write_stats_as_struct: Some(true),
column_mapping_mode: Some(ColumnMappingMode::Id),
Expand All @@ -286,9 +287,9 @@ mod tests {
log_retention_duration: Some(Duration::new(2, 0)),
enable_expired_log_cleanup: Some(true),
randomize_file_prefixes: Some(true),
random_prefix_length: Some(1001),
random_prefix_length: Some(NonZero::new(1001).unwrap()),
set_transaction_retention_duration: Some(Duration::new(60, 0)),
target_file_size: Some(1_000_000_000),
target_file_size: Some(NonZero::new(1_000_000_000).unwrap()),
tune_file_sizes_for_rewrites: Some(true),
checkpoint_policy: Some(CheckpointPolicy::V2),
enable_row_tracking: Some(true),
Expand Down
38 changes: 35 additions & 3 deletions kernel/src/table_properties/deserialize.rs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these custom deserializers have a fallback for cases where the table property exists but cannot parse to the expected type? Ideally, those should land in the unknown_properties map along with everything else, with the parsed version being None.

Needed because many table properties are not necessarily in the Delta spec, and so different clients may use them differently?

Copy link
Collaborator Author

@zachschuermann zachschuermann Nov 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I currently implemented that as failure: see fail_known_keys test. You're saying the behavior should instead be something like this?

    #[test]
    fn known_key_unknown_val() {
        let properties = HashMap::from([("delta.appendOnly".to_string(), "wack".to_string())]);
        let de = MapDeserializer::<_, de::value::Error>::new(properties.clone().into_iter());
        let table_properties = TableProperties::deserialize(de).unwrap();
        let unknown_properties = HashMap::from([("delta.appendOnly", "wack")]);
        let expected = TableProperties {
            unknown_properties,
            ..Default::default()
        };
        assert_eq!(table_properties, expected);
    }

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly. Otherwise I fear we'll eventually add yet another parsing failure ticket to the pile, along with all the CommitInfo issues.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yea fair point.. let me see how to do this :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

welp this was the straw that broke serde's back.. instead of implementing a custom Deserialize I just decided to do the naive thing and implement a big match by hand. Actually isn't that gross and we can always try serde again or just write our own macro to make everything nice

Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,16 @@
//! us to relatively simply implement the functionality described in the protocol and expose
//! 'simple' types to the user in the [`TableProperties`] struct. E.g. we can expose a `bool`
//! directly instead of a `BoolConfig` type that we implement `Deserialize` for.
use std::num::NonZero;
use std::time::Duration;

use super::*;
use crate::expressions::ColumnName;
use crate::table_features::ColumnMappingMode;
use crate::utils::require;

use tracing::warn;

const SECONDS_PER_MINUTE: u64 = 60;
const SECONDS_PER_HOUR: u64 = 60 * SECONDS_PER_MINUTE;
const SECONDS_PER_DAY: u64 = 24 * SECONDS_PER_HOUR;
Expand Down Expand Up @@ -80,10 +83,10 @@ fn try_parse(props: &mut TableProperties, k: &str, v: &str) -> Option<()> {

/// Deserialize a string representing a positive integer into an `Option<u64>`. Returns `Some` if
/// successfully parses, and `None` otherwise.
pub(crate) fn parse_positive_int(s: &str) -> Option<i64> {
pub(crate) fn parse_positive_int(s: &str) -> Option<NonZero<u64>> {
// parse to i64 (then check n > 0) since java doesn't even allow u64
let n: i64 = s.parse().ok()?;
(n > 0).then_some(n)
NonZero::new(n.try_into().ok()?)
}

/// Deserialize a string representing a boolean into an `Option<bool>`. Returns `Some` if
Expand All @@ -99,7 +102,9 @@ pub(crate) fn parse_bool(s: &str) -> Option<bool> {
/// Deserialize a comma-separated list of column names into an `Option<Vec<ColumnName>>`. Returns
/// `Some` if successfully parses, and `None` otherwise.
pub(crate) fn parse_column_names(s: &str) -> Option<Vec<ColumnName>> {
ColumnName::parse_column_name_list(s).ok()
ColumnName::parse_column_name_list(s)
.map_err(|e| warn!("column name list failed to parse: {e}"))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a job for inspect_err?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yep good catch thanks - I'll just squeeze that into the next small PR

.ok()
}

/// Deserialize an interval string of the form "interval 5 days" into an `Option<Duration>`.
Expand Down Expand Up @@ -185,6 +190,33 @@ fn parse_interval_impl(value: &str) -> Result<Duration, ParseIntervalError> {
#[cfg(test)]
mod tests {
use super::*;
use crate::expressions::column_name;

#[test]
fn test_parse_column_names() {
assert_eq!(
parse_column_names("`col 1`, col.a2,col3").unwrap(),
vec![
ColumnName::new(["col 1"]),
column_name!("col.a2"),
column_name!("col3")
]
);
}
Comment on lines +196 to +205
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_parse_column_name_list in column_names.rs already covers this pretty thoroughly?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm yea maybe that was unnecessary - i'll remove since this is basically jsut a pass-through to the column name parsing


#[test]
fn test_parse_bool() {
assert_eq!(parse_bool("true").unwrap(), true);
assert_eq!(parse_bool("false").unwrap(), false);
assert_eq!(parse_bool("whatever"), None);
}

#[test]
fn test_parse_positive_int() {
assert_eq!(parse_positive_int("123").unwrap().get(), 123);
assert_eq!(parse_positive_int("0"), None);
assert_eq!(parse_positive_int("-123"), None);
}

#[test]
fn test_parse_interval() {
Expand Down
Loading