Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement and test S3 Sink, with usage in runtime. #278

Merged
merged 8 commits into from
Jul 20, 2024
Merged

Implement and test S3 Sink, with usage in runtime. #278

merged 8 commits into from
Jul 20, 2024

Conversation

aliddell
Copy link
Member

No description provided.

@aliddell aliddell changed the base branch from 238-pt-5-2 to main July 17, 2024 17:53
@aliddell aliddell linked an issue Jul 18, 2024 that may be closed by this pull request
Copy link
Collaborator

@shlomnissan shlomnissan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t have clear blockers, but I think a round of cleanup and clarifications will be good before merging. There are several concerns that I have in general:

  • I'm worried that the error handling patterns in this feature are unclear. Using catch-all for exceptions and the practice of throwing for catching and logging is especially concerning. Error handling for URIs also feels incomplete.
  • I think we're using noexcept inappropriately (primarily to convey intent), which could cause issues over time.
  • I might be missing some context, but the use of certain containers (std::list and std::queue) doesn’t make a lot of sense to me. These are not trivial decisions because they typically convey intent and come with overhead.

Anyway, I think it’s almost there.

CHANGELOG.md Show resolved Hide resolved
src/common/utilities.cpp Outdated Show resolved Hide resolved
src/common/utilities.hh Show resolved Hide resolved
src/writers/file.sink.cpp Show resolved Hide resolved
src/writers/zarrv3.writer.cpp Show resolved Hide resolved
src/writers/sink.creator.cpp Outdated Show resolved Hide resolved
src/writers/sink.creator.cpp Outdated Show resolved Hide resolved
src/writers/sink.creator.cpp Outdated Show resolved Hide resolved
Comment on lines 347 to 348
const auto object_key = object_keys.front();
object_keys.pop();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Why do we need to pop all the objects? If it's not absolutely necessary we should use a vector.
  • What happens if an object fails (an iteration throws an exception)? If we pop it from the queue, it's lost no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I use the queue to construct paths, as we discussed in a previous PR. I'd have to rework that whole thing. I really don't think that a vector is the right tool for constructing paths like /(1...k)/(1...m)/(1...n)/.../(1...z).
  • I've been thinking that since we're not actually doing anything resource intensive, like opening a file, when we create an S3Sink, we can do this serially.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I’m missing something here. I’ll tinker with the code to better understand this requirement.

src/writers/sink.creator.cpp Outdated Show resolved Hide resolved
std::shared_ptr<common::S3ConnectionPool> connection_pool_;

// multipart upload
std::array<uint8_t, 5 << 20> part_buffer_;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would restore the comment or create a variable (static constexpr size_t five_mb_size = 5 << 20) so it's obvious what this number represents.

Comment on lines 112 to 118
auto get_upload_id = [this, &connection]() -> std::string {
if (upload_id_.empty()) {
upload_id_ =
connection->create_multipart_object(bucket_name_, object_key_);
}
return upload_id_;
};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be a separate method. Sorry if my comment was confusing—I typically refer to accessor methods as computed properties.

}

bool
zarr::S3Sink::write(size_t offset, const uint8_t* data, size_t bytes_of_data)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A common pattern for identifying unused parameters is using underscores.

zarr::S3Sink::write(size_t _, const uint8_t* data, size_t bytes_of_data)

Comment on lines 347 to 348
const auto object_key = object_keys.front();
object_keys.pop();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I’m missing something here. I’ll tinker with the code to better understand this requirement.

Copy link
Collaborator

@shlomnissan shlomnissan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of minor comments, otherwise, LGTM 👍

@aliddell aliddell merged commit 597eed5 into main Jul 20, 2024
3 checks passed
@aliddell aliddell deleted the 238-pt-6 branch July 21, 2024 18:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement S3 Sinks
2 participants