-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SNOW-1675591 Fill in ExternalVolume and ExternalVolumeManager to do presigned url retrieval + blobname population #837
Conversation
sfc-gh-hmadan
commented
Sep 20, 2024
- Add a GeneratePresignedUrls API call (new contracts + snowflakeClient API)
- Enhance ExternalVolume.java to do presigned url file uploads to S3 (GCP and Azure testing pending thus commented out)
- fix json field names where applicable
- Add @JsonIgnoreProperties(ignoreUnknown = true) for forward compatiblity with service side evolution to some contracts that were missing this tag
fix unit tests change non_default to non_null
0f005ae
to
ceee1a2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks mostly good. Just some minor comments.
src/main/java/net/snowflake/ingest/streaming/internal/ChunkMetadata.java
Show resolved
Hide resolved
src/main/java/net/snowflake/ingest/streaming/internal/ExternalVolume.java
Outdated
Show resolved
Hide resolved
src/main/java/net/snowflake/ingest/streaming/internal/ExternalVolume.java
Outdated
Show resolved
Hide resolved
src/main/java/net/snowflake/ingest/streaming/internal/ExternalVolume.java
Show resolved
Hide resolved
src/main/java/net/snowflake/ingest/streaming/internal/ExternalVolume.java
Outdated
Show resolved
Hide resolved
src/main/java/net/snowflake/ingest/streaming/internal/ExternalVolume.java
Show resolved
Hide resolved
src/main/java/net/snowflake/ingest/streaming/internal/ExternalVolume.java
Show resolved
Hide resolved
} | ||
} | ||
if (generate) { | ||
// TODO: do this generation on a background thread to allow the current thread to make |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like the background thread approach would also be simpler - you won't need all the Semaphore business. The bg thread could just keep refreshing in a loop whenever it sees the remaining urls hit a low.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, couple reasons I kept this as simple as possible:
-
There can be many many external volumes (one per table), and the right fix there would be to have a small threadpool that multiplexes over all the active external volumes, and potentially even does a batched retrieval instead of one RPC per ExtVol. This is doable, just a matter of work ordering
-
rate matching - some ext vols will need 1 URL every minute, other volumes will need 10 URLs every second; to do a proper fix the low watermark needs to be adaptive instead of the hardcoded 5 right now.
-
Multipart upload support - to do multipart upload, each part requires its' own presigned URL AND needs to be associated with an upload-id; however now we can't cache URLs in advance as we won't know (in advance) how many parts any given file will have and don't have a batch of upload-ids either to quickly choose from.
Of these three, ended up just having a roundabout fix for (2) by allowing concurrent URL retrieval, as soon as there's high pressure on URL retrieval we'll end up caching a lot of URLs and reduce pressure.
- fix the test failures due to new json fields that the already-deployed snowflake service does not expect to see (it is not forward compatible). minor cleanup along the way.
00d842f
to
126799e
Compare