SNOW-1654124: Write file name to metadata at the place when we create the file #824

sfc-gh-tzhang · 2024-09-07T05:22:13Z

Write file name to metadata at the place when we create the file (only works for serializeFromJavaObjects)
Looks like serializeFromParquetWriteBuffers is not maintained to work with primaryFileId as the ParquetWriter is created during setupSchema, we would better remove the code as a whole if we decide to not use it, or a bigger, separated change is required.

This reverts commit 412ad3d.

sfc-gh-tzhang · 2024-09-07T19:10:37Z

src/main/java/net/snowflake/ingest/streaming/internal/ParquetFlusher.java

+    // We insert the filename in the file itself as metadata so that streams can work on replicated
+    // mixed tables. For a more detailed discussion on the topic see SNOW-561447 and
+    // http://go/streams-on-replicated-mixed-tables
+    metadata.put(Constants.PRIMARY_FILE_ID_KEY, StreamingIngestUtils.getShortname(filePath));


This is the major change, i move the place where we put the file name

src/main/java/net/snowflake/ingest/streaming/internal/ParquetChunkData.java

src/main/java/net/snowflake/ingest/streaming/internal/ParquetRowBuffer.java

src/test/java/net/snowflake/ingest/streaming/internal/RowBufferTest.java

sfc-gh-hmadan · 2024-09-08T23:29:16Z

wondering if it is possible to defer filename insertion (into metadata) and do it inside InternalStage.put() ?

I'm working on a very related area - uploading to presigned urls for iceberg - and was thinking through cases where the URL has expired and we want to do a retry with a new URL. With the way things are setup right now, if the presigned token has expired I'll end up redoing the buildAndUpload call in the task that FlushService creates with a new URL each time, which will be more wasteful than just reserializing the metadata chunk when a new filename needs to be used.

This reverts commit 22773bb.

sfc-gh-tzhang · 2024-09-09T06:44:57Z

wondering if it is possible to defer filename insertion (into metadata) and do it inside InternalStage.put() ?

I'm working on a very related area - uploading to presigned urls for iceberg - and was thinking through cases where the URL has expired and we want to do a retry with a new URL. With the way things are setup right now, if the presigned token has expired I'll end up redoing the buildAndUpload call in the task that FlushService creates with a new URL each time, which will be more wasteful than just reserializing the metadata chunk when a new filename needs to be used.

It's possible, but let's discuss this outside of this PR? We want to get this in ASAP and create a new release on top.

sfc-gh-gdoci

Thanks, fix lgtm. A test to repro the original issue would be good to have.

Re serializeFromParquetWriteBuffers: Good that you bring it up, I'll remove it because we don't plan to enable it.

sfc-gh-azagrebin

Maybe I am missing something but I think the test is missing for this change. Otherwise LGTM.

It could be at least write/read test in SDK because there is no way to test it before release with our server side tests.
I could be something like this:
create test buffer -> create flusher from it -> flush to byte[] -> create BdecParquetReader from bytes -> get metadata: fileReader.getFileMetaData().getKeyValueMetaData() (ParquetFileReader.fileReader needs to become a field to get metadata from it)

sfc-gh-hmadan

code change looks good, lets add a test before checking it in!

sfc-gh-tzhang · 2024-09-10T04:46:36Z

Maybe I am missing something but I think the test is missing for this change. Otherwise LGTM.

It could be at least write/read test in SDK because there is no way to test it before release with our server side tests. I could be something like this: create test buffer -> create flusher from it -> flush to byte[] -> create BdecParquetReader from bytes -> get metadata: fileReader.getFileMetaData().getKeyValueMetaData() (ParquetFileReader.fileReader needs to become a field to get metadata from it)

Thanks Andrey! Took me some time but I'm able to add testParquetFileNameMetadata based on your suggestion, PTAL

sfc-gh-tzhang · 2024-09-10T04:48:48Z

pom.xml

+      <dependency>
+        <groupId>org.apache.hadoop</groupId>
+        <artifactId>hadoop-mapreduce-client-core</artifactId>
+        <version>${hadoop.version}</version>
+        <scope>test</scope>
+      </dependency>


This is needed for ParquetFileReader to work

fix corruption

8f95d64

sfc-gh-tzhang requested review from a team as code owners September 7, 2024 05:22

sfc-gh-tzhang added 2 commits September 7, 2024 05:43

fix tests:

df57b6e

Revert "SNOW-1618257 Fix PRIMARY_FILE_ID_KEY (#807)"

22773bb

This reverts commit 412ad3d.

sfc-gh-tzhang commented Sep 7, 2024

View reviewed changes

sfc-gh-tzhang requested review from sfc-gh-azagrebin, sfc-gh-kkloudas and sfc-gh-gdoci September 7, 2024 19:13

sfc-gh-hmadan reviewed Sep 8, 2024

View reviewed changes

src/main/java/net/snowflake/ingest/streaming/internal/ParquetChunkData.java Outdated Show resolved Hide resolved

sfc-gh-hmadan reviewed Sep 8, 2024

View reviewed changes

src/main/java/net/snowflake/ingest/streaming/internal/ParquetRowBuffer.java Show resolved Hide resolved

sfc-gh-hmadan reviewed Sep 8, 2024

View reviewed changes

src/test/java/net/snowflake/ingest/streaming/internal/RowBufferTest.java Show resolved Hide resolved

sfc-gh-tzhang added 3 commits September 9, 2024 05:57

Reapply "SNOW-1618257 Fix PRIMARY_FILE_ID_KEY (#807)"

a3f5a3b

This reverts commit 22773bb.

address comments

3d49a91

format

3bfc937

sfc-gh-gdoci approved these changes Sep 9, 2024

View reviewed changes

sfc-gh-azagrebin reviewed Sep 9, 2024

View reviewed changes

sfc-gh-hmadan approved these changes Sep 9, 2024

View reviewed changes

add tests

86de156

add dependency

dcf8841

sfc-gh-tzhang commented Sep 10, 2024

View reviewed changes

sfc-gh-tzhang added 2 commits September 10, 2024 06:36

fix failure

504d781

fix test failure

2831b28

sfc-gh-tzhang merged commit 3734061 into master Sep 10, 2024
47 checks passed

sfc-gh-tzhang deleted the tzhang-si-corruption branch September 10, 2024 18:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SNOW-1654124: Write file name to metadata at the place when we create the file #824

SNOW-1654124: Write file name to metadata at the place when we create the file #824

sfc-gh-tzhang commented Sep 7, 2024 •

edited

Loading

sfc-gh-tzhang Sep 7, 2024

sfc-gh-hmadan commented Sep 8, 2024

sfc-gh-tzhang commented Sep 9, 2024

sfc-gh-gdoci left a comment

sfc-gh-azagrebin left a comment •

edited

Loading

sfc-gh-hmadan left a comment

sfc-gh-tzhang commented Sep 10, 2024

sfc-gh-tzhang Sep 10, 2024

SNOW-1654124: Write file name to metadata at the place when we create the file #824

SNOW-1654124: Write file name to metadata at the place when we create the file #824

Conversation

sfc-gh-tzhang commented Sep 7, 2024 • edited Loading

sfc-gh-tzhang Sep 7, 2024

Choose a reason for hiding this comment

sfc-gh-hmadan commented Sep 8, 2024

sfc-gh-tzhang commented Sep 9, 2024

sfc-gh-gdoci left a comment

Choose a reason for hiding this comment

sfc-gh-azagrebin left a comment • edited Loading

Choose a reason for hiding this comment

sfc-gh-hmadan left a comment

Choose a reason for hiding this comment

sfc-gh-tzhang commented Sep 10, 2024

sfc-gh-tzhang Sep 10, 2024

Choose a reason for hiding this comment

sfc-gh-tzhang commented Sep 7, 2024 •

edited

Loading

sfc-gh-azagrebin left a comment •

edited

Loading