Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-41973: Expose new S3 option check_directory_existence_before_creation #41998

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

HaochengLIU
Copy link
Contributor

Rationale for this change

Expose new S3 option check_directory_existence_before_creation from GH-41493
PR to expose in Python is also under review.

What changes are included in this PR?

Expose new S3 option check_directory_existence_before_creation from GH-41493

Are these changes tested?

yes

Are there any user-facing changes?

Yes. R function documentation is updated.

Copy link

github-actions bot commented Jun 6, 2024

⚠️ GitHub issue #41973 has been automatically assigned in GitHub to PR creator.

@HaochengLIU
Copy link
Contributor Author

Hi @nealrichardson , could you pls review when you have time? I have to admit I use C++/Python in my daily life but know nothing about R and just follow the existing code for allow_bucket_deletion to expose it.

@HaochengLIU
Copy link
Contributor Author

Hi @paleolimbot @thisisnic , could you review this PR when you have time? Thanks.

@amoeba
Copy link
Member

amoeba commented Aug 1, 2024

Hi @HaochengLIU, I took a look through and left some suggestions.

One thing I think we'll have to sort out here is the issue that's making the Check minimum supported Arrow C++ Version check fail. We'll need an ifdef guard in r/src/filesystem.cpp but I'm not sure if we've got a pattern for what the user experiences.

Edit: And can you please rebase and fix conflicts?

@HaochengLIU HaochengLIU force-pushed the check_directory_existence_before_creation-expose-R branch from 7f3d26e to e46e0dc Compare August 2, 2024 00:23
@HaochengLIU HaochengLIU requested a review from jonkeane as a code owner August 2, 2024 00:23
@HaochengLIU
Copy link
Contributor Author

Hi @amoeba , I rebased and the somehow do not see the suggestion you are referring to.. Could you point me to that?
Also could you guide what do you mean by "ifdef guard" in the filesystem.cpp? It's not using any standard or platform dependent code, simply a new argument

Per the install error

* checking whether package ‘arrow’ can be installed ... [87s/90s] ERROR
Installation failed.
See ‘/home/runner/work/arrow/arrow/src/r/check/arrow.Rcheck/00install.out’ for details.

Where is this 00install.out log 😢
ty.

@thisisnic
Copy link
Member

thisisnic commented Aug 4, 2024

@HaochengLIU The CI output there can be a bit tricky to follow, but the important bit is in the next section: https://github.com/apache/arrow/actions/runs/10207786938/job/28243188693?pr=41998#step:8:49

 filesystem.cpp: In function ‘std::shared_ptr<arrow::fs::S3FileSystem> fs___S3FileSystem__create(bool, std::string, std::string, std::string, std::string, std::string, std::string, int, std::string, std::string, std::string, std::string, bool, bool, bool, bool, double, double)’:
filesystem.cpp:334:11: error: ‘struct arrow::fs::S3Options’ has no member named ‘check_directory_existence_before_creation’
  334 |   s3_opts.check_directory_existence_before_creation =
      |           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
make: *** [/usr/lib/R/etc/Makeconf:204: filesystem.o] Error 1
ERROR: compilation failed for package ‘arrow’

I think what @amoeba is referring to is that fact that the arrow R package is compatible with previous versions of the Arrow C++ library, and so you're getting that error above because #41822 was only released in version 17.0.0 so you'll get errors trying to use it with previous versions, as our minimum version to support is 15.0.2.

@@ -55,6 +56,7 @@ limited_fs <- S3FileSystem$create(
endpoint_override = paste0("localhost:", minio_port),
allow_bucket_creation = FALSE,
allow_bucket_deletion = FALSE
check_directory_existence_before_creation = false,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
check_directory_existence_before_creation = false,
check_directory_existence_before_creation = FALSE,

r/NEWS.md Outdated
@@ -18,6 +18,7 @@
-->

# arrow 16.1.0.9000
* Expose an option `check_directory_existence_before_creation` in `S3FileSystem` which defaults to false. If it's set to false, when creating a directory the code will not check if it already exists or not. It's an optimization to try directory creation and catch the error, rather than issue two dependent I/O calls. If true, when creating a directory the code will only create the directory when necessary at the cost of extra I/O calls. This can be used for key/value cloud storage which has a hard rate limit to number of object mutation operations or scenerios such as the directories already exist and you do not have creation access.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Expose an option `check_directory_existence_before_creation` in `S3FileSystem` which defaults to false. If it's set to false, when creating a directory the code will not check if it already exists or not. It's an optimization to try directory creation and catch the error, rather than issue two dependent I/O calls. If true, when creating a directory the code will only create the directory when necessary at the cost of extra I/O calls. This can be used for key/value cloud storage which has a hard rate limit to number of object mutation operations or scenerios such as the directories already exist and you do not have creation access.
* Expose an option `check_directory_existence_before_creation` in `S3FileSystem` which defaults to `FALSE`. If it's set to false, when creating a directory the code will not check if it already exists or not. It's an optimization to try directory creation and catch the error, rather than issue two dependent I/O calls. If set to `TRUE`, when creating a directory the code will only create the directory when necessary at the cost of extra I/O calls. This can be used for key/value cloud storage which has a hard rate limit to number of object mutation operations or scenarios such as the directories already exist and you do not have creation access.

r/R/filesystem.R Outdated
@@ -156,6 +156,13 @@ FileSelector$create <- function(base_dir, allow_not_found = FALSE, recursive = F
#' buckets if `$CreateDir()` is called on the bucket level (default `FALSE`).
#' - `allow_bucket_deletion`: logical, if TRUE, the filesystem will delete
#' buckets if`$DeleteDir()` is called on the bucket level (default `FALSE`).
#' - `check_directory_existence_before_creation`: logical, if FALSE, when creating a directory the code will
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#' - `check_directory_existence_before_creation`: logical, if FALSE, when creating a directory the code will
#' - `check_directory_existence_before_creation`: logical, if `FALSE`, when creating a directory the code will

r/R/filesystem.R Outdated
#' - `check_directory_existence_before_creation`: logical, if FALSE, when creating a directory the code will
#' . not check if it already exists or not. It's an optimization to try directory creation and catch the error,
#' rather than issue two dependent I/O calls.
#' if TRUE, when creating a directory the code will only create the directory when necessary
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#' if TRUE, when creating a directory the code will only create the directory when necessary
#' if `TRUE`, when creating a directory the code will only create the directory when necessary

r/R/filesystem.R Outdated
#' rather than issue two dependent I/O calls.
#' if TRUE, when creating a directory the code will only create the directory when necessary
#' at the cost of extra I/O calls. This can be used for key/value cloud storage which has
#' a hard rate limit to number of object mutation operations or scenerios such as
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#' a hard rate limit to number of object mutation operations or scenerios such as
#' a hard rate limit to number of object mutation operations or scenarios such as

@@ -55,6 +56,7 @@ limited_fs <- S3FileSystem$create(
endpoint_override = paste0("localhost:", minio_port),
allow_bucket_creation = FALSE,
allow_bucket_deletion = FALSE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
allow_bucket_deletion = FALSE
allow_bucket_deletion = FALSE,

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Aug 6, 2024
@amoeba
Copy link
Member

amoeba commented Aug 6, 2024

Sorry @HaochengLIU, I think I forgot to save my review. You should see them now.

@thisisnic is right about my note. See this example:

#if ARROW_VERSION_MAJOR >= 16

I think here we want to do here is (1) conditionally define two versions of fs___S3FileSystem__create, one that supports check_directory_existence_before_creation and one that doesn't and then (2) catch calls to S3FileSystem$create which pass the option check_directory_existence_before_creation when arrow::arrow_info()$build_info$cpp_version < "17.0.0" and probably emit a warning but otherwise continue. Let me know if this isn't clear.

@nealrichardson
Copy link
Member

Sorry @HaochengLIU, I think I forgot to save my review. You should see them now.

@thisisnic is right about my note. See this example:

#if ARROW_VERSION_MAJOR >= 16

I think here we want to do here is (1) conditionally define two versions of fs___S3FileSystem__create, one that supports check_directory_existence_before_creation and one that doesn't and then (2) catch calls to S3FileSystem$create which pass the option check_directory_existence_before_creation when arrow::arrow_info()$build_info$cpp_version < "17.0.0" and probably emit a warning but otherwise continue. Let me know if this isn't clear.

The tricky thing about that is the codegen script that creates arrowExports.cpp: it doesn't know how to handle the conditional definitions like that, and I'm not sure it's worth extending to do so.

Another option would be to refactor to send a list of options to fs___S3FileSystem__create (and the other FileSystem create methods, for consistency) and then pull the settings out of the list inside the C++ function, like how make_compute_options() works (https://github.com/apache/arrow/blob/main/r/src/compute.cpp#L129). Then the #if ARROW_VERSION_MAJOR >= 16 would just be wrapping a few lines inside that function and not affecting the signature.

@HaochengLIU HaochengLIU force-pushed the check_directory_existence_before_creation-expose-R branch from e46e0dc to 913400d Compare August 8, 2024 16:24
@HaochengLIU
Copy link
Contributor Author

HaochengLIU commented Aug 8, 2024

Hi folks, thanks for reviewing and I've addressed most of the comments.
Referring to the non-backward compatible check_directory_existence_before_creation flag, IMHO refactoring fs___S3FileSystem__create (and the other FileSystem create methods, for consistency) should be a seperate PR.

If it's confirmed that arrow/r/src/extension-impl.cpp will not work per @nealrichardson , I can try to tackle the refactor work first once I'm back from vacation

@jonkeane
Copy link
Member

jonkeane commented Aug 8, 2024

We probably shouldn't do this here, but we should have a discussion about how much effort keeping compatibility with older versions we should push for PRs like these. I've created #43623 for that, but broadly am supportive of bumping the minimum version if we need to in order to implement this feature.

@amoeba
Copy link
Member

amoeba commented Aug 9, 2024

Thanks for the catch @nealrichardson and the suggestion. I'd be supportive of approving this PR with just the "Check minimum supported Arrow C++ Version " job failing and just my existing suggestions applied.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants