Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-1654124: Write file name to metadata at the place when we create the file #824

Merged
merged 10 commits into from
Sep 10, 2024

Conversation

sfc-gh-tzhang
Copy link
Contributor

@sfc-gh-tzhang sfc-gh-tzhang commented Sep 7, 2024

  • Write file name to metadata at the place when we create the file (only works for serializeFromJavaObjects)
  • Looks like serializeFromParquetWriteBuffers is not maintained to work with primaryFileId as the ParquetWriter is created during setupSchema, we would better remove the code as a whole if we decide to not use it, or a bigger, separated change is required.

@sfc-gh-tzhang sfc-gh-tzhang requested review from a team as code owners September 7, 2024 05:22
Comment on lines +215 to +218
// We insert the filename in the file itself as metadata so that streams can work on replicated
// mixed tables. For a more detailed discussion on the topic see SNOW-561447 and
// http://go/streams-on-replicated-mixed-tables
metadata.put(Constants.PRIMARY_FILE_ID_KEY, StreamingIngestUtils.getShortname(filePath));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the major change, i move the place where we put the file name

@sfc-gh-hmadan
Copy link
Collaborator

wondering if it is possible to defer filename insertion (into metadata) and do it inside InternalStage.put() ?

I'm working on a very related area - uploading to presigned urls for iceberg - and was thinking through cases where the URL has expired and we want to do a retry with a new URL. With the way things are setup right now, if the presigned token has expired I'll end up redoing the buildAndUpload call in the task that FlushService creates with a new URL each time, which will be more wasteful than just reserializing the metadata chunk when a new filename needs to be used.

@sfc-gh-tzhang
Copy link
Contributor Author

wondering if it is possible to defer filename insertion (into metadata) and do it inside InternalStage.put() ?

I'm working on a very related area - uploading to presigned urls for iceberg - and was thinking through cases where the URL has expired and we want to do a retry with a new URL. With the way things are setup right now, if the presigned token has expired I'll end up redoing the buildAndUpload call in the task that FlushService creates with a new URL each time, which will be more wasteful than just reserializing the metadata chunk when a new filename needs to be used.

It's possible, but let's discuss this outside of this PR? We want to get this in ASAP and create a new release on top.

Copy link
Collaborator

@sfc-gh-gdoci sfc-gh-gdoci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fix lgtm. A test to repro the original issue would be good to have.

Re serializeFromParquetWriteBuffers: Good that you bring it up, I'll remove it because we don't plan to enable it.

Copy link
Contributor

@sfc-gh-azagrebin sfc-gh-azagrebin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I am missing something but I think the test is missing for this change. Otherwise LGTM.

It could be at least write/read test in SDK because there is no way to test it before release with our server side tests.
I could be something like this:
create test buffer -> create flusher from it -> flush to byte[] -> create BdecParquetReader from bytes -> get metadata: fileReader.getFileMetaData().getKeyValueMetaData() (ParquetFileReader.fileReader needs to become a field to get metadata from it)

Copy link
Collaborator

@sfc-gh-hmadan sfc-gh-hmadan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code change looks good, lets add a test before checking it in!

@sfc-gh-tzhang
Copy link
Contributor Author

Maybe I am missing something but I think the test is missing for this change. Otherwise LGTM.

It could be at least write/read test in SDK because there is no way to test it before release with our server side tests. I could be something like this: create test buffer -> create flusher from it -> flush to byte[] -> create BdecParquetReader from bytes -> get metadata: fileReader.getFileMetaData().getKeyValueMetaData() (ParquetFileReader.fileReader needs to become a field to get metadata from it)

Thanks Andrey! Took me some time but I'm able to add testParquetFileNameMetadata based on your suggestion, PTAL

pom.xml Outdated
Comment on lines 355 to 360
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-core</artifactId>
<version>${hadoop.version}</version>
<scope>test</scope>
</dependency>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed for ParquetFileReader to work

@sfc-gh-tzhang sfc-gh-tzhang merged commit 3734061 into master Sep 10, 2024
47 checks passed
@sfc-gh-tzhang sfc-gh-tzhang deleted the tzhang-si-corruption branch September 10, 2024 18:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants