-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SNOW-1654124: Write file name to metadata at the place when we create the file #824
Conversation
This reverts commit 412ad3d.
// We insert the filename in the file itself as metadata so that streams can work on replicated | ||
// mixed tables. For a more detailed discussion on the topic see SNOW-561447 and | ||
// http://go/streams-on-replicated-mixed-tables | ||
metadata.put(Constants.PRIMARY_FILE_ID_KEY, StreamingIngestUtils.getShortname(filePath)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the major change, i move the place where we put the file name
src/main/java/net/snowflake/ingest/streaming/internal/ParquetChunkData.java
Outdated
Show resolved
Hide resolved
src/main/java/net/snowflake/ingest/streaming/internal/ParquetRowBuffer.java
Show resolved
Hide resolved
src/test/java/net/snowflake/ingest/streaming/internal/RowBufferTest.java
Show resolved
Hide resolved
wondering if it is possible to defer filename insertion (into metadata) and do it inside InternalStage.put() ? I'm working on a very related area - uploading to presigned urls for iceberg - and was thinking through cases where the URL has expired and we want to do a retry with a new URL. With the way things are setup right now, if the presigned token has expired I'll end up redoing the buildAndUpload call in the task that FlushService creates with a new URL each time, which will be more wasteful than just reserializing the metadata chunk when a new filename needs to be used. |
This reverts commit 22773bb.
It's possible, but let's discuss this outside of this PR? We want to get this in ASAP and create a new release on top. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, fix lgtm. A test to repro the original issue would be good to have.
Re serializeFromParquetWriteBuffers: Good that you bring it up, I'll remove it because we don't plan to enable it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I am missing something but I think the test is missing for this change. Otherwise LGTM.
It could be at least write/read test in SDK because there is no way to test it before release with our server side tests.
I could be something like this:
create test buffer -> create flusher from it -> flush to byte[] -> create BdecParquetReader from bytes -> get metadata: fileReader.getFileMetaData().getKeyValueMetaData()
(ParquetFileReader.fileReader
needs to become a field to get metadata from it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code change looks good, lets add a test before checking it in!
Thanks Andrey! Took me some time but I'm able to add |
pom.xml
Outdated
<dependency> | ||
<groupId>org.apache.hadoop</groupId> | ||
<artifactId>hadoop-mapreduce-client-core</artifactId> | ||
<version>${hadoop.version}</version> | ||
<scope>test</scope> | ||
</dependency> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is needed for ParquetFileReader to work
serializeFromJavaObjects
)serializeFromParquetWriteBuffers
is not maintained to work withprimaryFileId
as theParquetWriter
is created duringsetupSchema
, we would better remove the code as a whole if we decide to not use it, or a bigger, separated change is required.