-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add chunk offset to file id key to make each chunk have a unique key #825
Conversation
74eaa4e
to
4165b5a
Compare
src/main/java/net/snowflake/ingest/streaming/internal/ParquetFlusher.java
Outdated
Show resolved
Hide resolved
4165b5a
to
4ce7ec7
Compare
… Also add sdk version to Parquet metadata.
4ce7ec7
to
6a46ded
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change looks good in SDK, please sync with @sfc-gh-kkloudas to make sure this change won't break replication.
final String[] parts = shortName.split("\\."); | ||
metadata.put( | ||
Constants.PRIMARY_FILE_ID_KEY, | ||
String.format("%s%s.%s", parts[0], chunkStartOffset, parts[1])); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any issue if the does not match with the file name? Is there any server side change this needs to wait before merging? cc: @sfc-gh-kkloudas
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And do you think we can do %s_%s.%s?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am trying to match with what Andrey will do on the server side - we don't have to match but would be good to be consistent. Let's have @sfc-gh-azagrebin confirm what he'd do. We can wait for the server side change. And no, it does not need to match the file name at all - it can be a UUID for all change tracking cares about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And if we change the format, is there a use case that customer needs to join with previous values as part of MERGE?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this value is supposed to be unique that does not appear in any other files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also vote for this:
And do you think we can do %s_%s.%s?
let me also first try the server side change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just in case to reduce any potential blast radius, I would append only non-zero offset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We just discussed it with @sfc-gh-lpapke. We have to add "_" because the name w/o extension ends with a number in our case and we can get into a similar problem that was mentioned for regular partitions:
name_10.bdec + offset 110 = name_10110.bdec
name_101.bdec + offset 10 = name_10110.bdec
those 2 are different bdecs but they will be the same PPNs.
or append to ".bdec" that would be the natural delimiter
maybe this is actually the safest option long term
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. I added an _
separator before the .bdec
suffix. PTAL. I have also added a check for 0 chunk offset.
src/main/java/net/snowflake/ingest/streaming/internal/ParquetRowBuffer.java
Show resolved
Hide resolved
e978fa1
to
3f6a5c0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Purujit!
Also add sdk version to Parquet metadata.