Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-1675591 Fill in ExternalVolume and ExternalVolumeManager to do presigned url retrieval + blobname population #837

Merged
merged 3 commits into from
Sep 25, 2024

Conversation

sfc-gh-hmadan
Copy link
Collaborator

  1. Add a GeneratePresignedUrls API call (new contracts + snowflakeClient API)
  2. Enhance ExternalVolume.java to do presigned url file uploads to S3 (GCP and Azure testing pending thus commented out)
  3. fix json field names where applicable
  4. Add @JsonIgnoreProperties(ignoreUnknown = true) for forward compatiblity with service side evolution to some contracts that were missing this tag

fix unit tests

change non_default to non_null
Copy link
Contributor

@sfc-gh-psaha sfc-gh-psaha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks mostly good. Just some minor comments.

}
}
if (generate) {
// TODO: do this generation on a background thread to allow the current thread to make
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like the background thread approach would also be simpler - you won't need all the Semaphore business. The bg thread could just keep refreshing in a loop whenever it sees the remaining urls hit a low.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, couple reasons I kept this as simple as possible:

  1. There can be many many external volumes (one per table), and the right fix there would be to have a small threadpool that multiplexes over all the active external volumes, and potentially even does a batched retrieval instead of one RPC per ExtVol. This is doable, just a matter of work ordering

  2. rate matching - some ext vols will need 1 URL every minute, other volumes will need 10 URLs every second; to do a proper fix the low watermark needs to be adaptive instead of the hardcoded 5 right now.

  3. Multipart upload support - to do multipart upload, each part requires its' own presigned URL AND needs to be associated with an upload-id; however now we can't cache URLs in advance as we won't know (in advance) how many parts any given file will have and don't have a batch of upload-ids either to quickly choose from.

Of these three, ended up just having a roundabout fix for (2) by allowing concurrent URL retrieval, as soon as there's high pressure on URL retrieval we'll end up caching a lot of URLs and reduce pressure.

- fix the test failures due to new json fields that the already-deployed
  snowflake service does not expect to see (it is not forward
compatible). minor cleanup along the way.
@sfc-gh-hmadan sfc-gh-hmadan merged commit 7b3881b into master Sep 25, 2024
45 checks passed
@sfc-gh-hmadan sfc-gh-hmadan deleted the hmadan-iceberg-sep19 branch September 25, 2024 20:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants