Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add chunk offset to file id key to make each chunk have a unique key #825

Merged
merged 5 commits into from
Sep 18, 2024

Conversation

sfc-gh-psaha
Copy link
Contributor

@sfc-gh-psaha sfc-gh-psaha commented Sep 11, 2024

Also add sdk version to Parquet metadata.

@sfc-gh-psaha sfc-gh-psaha requested review from sfc-gh-tzhang and a team as code owners September 11, 2024 22:17
@sfc-gh-psaha sfc-gh-psaha force-pushed the psaha-SNOW-1660058-add-chunk-index branch from 74eaa4e to 4165b5a Compare September 11, 2024 22:21
@sfc-gh-psaha sfc-gh-psaha force-pushed the psaha-SNOW-1660058-add-chunk-index branch from 4165b5a to 4ce7ec7 Compare September 16, 2024 18:00
@sfc-gh-psaha sfc-gh-psaha changed the title Add chunk index to file id key to make each chunk have a unique key Add chunk offset to file id key to make each chunk have a unique key Sep 16, 2024
@sfc-gh-psaha sfc-gh-psaha force-pushed the psaha-SNOW-1660058-add-chunk-index branch from 4ce7ec7 to 6a46ded Compare September 16, 2024 18:01
Copy link
Contributor

@sfc-gh-tzhang sfc-gh-tzhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change looks good in SDK, please sync with @sfc-gh-kkloudas to make sure this change won't break replication.

final String[] parts = shortName.split("\\.");
metadata.put(
Constants.PRIMARY_FILE_ID_KEY,
String.format("%s%s.%s", parts[0], chunkStartOffset, parts[1]));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any issue if the does not match with the file name? Is there any server side change this needs to wait before merging? cc: @sfc-gh-kkloudas

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And do you think we can do %s_%s.%s?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am trying to match with what Andrey will do on the server side - we don't have to match but would be good to be consistent. Let's have @sfc-gh-azagrebin confirm what he'd do. We can wait for the server side change. And no, it does not need to match the file name at all - it can be a UUID for all change tracking cares about.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And if we change the format, is there a use case that customer needs to join with previous values as part of MERGE?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this value is supposed to be unique that does not appear in any other files.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also vote for this:

And do you think we can do %s_%s.%s?

let me also first try the server side change

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just in case to reduce any potential blast radius, I would append only non-zero offset.

Copy link
Contributor

@sfc-gh-azagrebin sfc-gh-azagrebin Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We just discussed it with @sfc-gh-lpapke. We have to add "_" because the name w/o extension ends with a number in our case and we can get into a similar problem that was mentioned for regular partitions:
name_10.bdec + offset 110 = name_10110.bdec
name_101.bdec + offset 10 = name_10110.bdec
those 2 are different bdecs but they will be the same PPNs.

or append to ".bdec" that would be the natural delimiter
maybe this is actually the safest option long term

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I added an _ separator before the .bdec suffix. PTAL. I have also added a check for 0 chunk offset.

@sfc-gh-psaha sfc-gh-psaha force-pushed the psaha-SNOW-1660058-add-chunk-index branch from e978fa1 to 3f6a5c0 Compare September 16, 2024 18:15
Copy link
Contributor

@sfc-gh-azagrebin sfc-gh-azagrebin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Purujit!

@sfc-gh-psaha sfc-gh-psaha enabled auto-merge (squash) September 18, 2024 16:39
@sfc-gh-psaha sfc-gh-psaha reopened this Sep 18, 2024
@sfc-gh-psaha sfc-gh-psaha enabled auto-merge (squash) September 18, 2024 18:58
@sfc-gh-psaha sfc-gh-psaha merged commit a107b50 into master Sep 18, 2024
91 checks passed
@sfc-gh-psaha sfc-gh-psaha deleted the psaha-SNOW-1660058-add-chunk-index branch September 18, 2024 20:04
sfc-gh-psaha added a commit that referenced this pull request Oct 1, 2024
sfc-gh-psaha added a commit that referenced this pull request Oct 1, 2024
#848)

Revert "Add chunk offset to file id key to make each chunk have a unique key (#825)"

This reverts commit a107b50.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants