-
Notifications
You must be signed in to change notification settings - Fork 185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support VFS ls_recursive API for posix filesystem. #4778
Conversation
This pull request has been linked to Shortcut Story #42004: Implement ls_recursive support for LocalFs. |
I gotta set up We also discussed adding tests to test/src/unit-cppapi-vfs.cc which I just noticed I forgot to do before creating this PR. |
tiledb/api/c_api/vfs/vfs_api.cc
Outdated
vfs->ls_recursive(tiledb::sm::URI(path), callback, data); | ||
|
||
tiledb::sm::LsObjects lsresults; | ||
throw_if_not_ok(vfs->ls_recursive(tiledb::sm::URI(path), &lsresults)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the reviewers:
This change was one of the ways to get the empty subdirectories to appear in the external API output (they appear just fine in the internal API output).
Previously the callback supplied here found its way down the call stack and was being used as the file_filter
argument to the internal API. We wouldn't see empty directories (or any directories) for that, of course, since the the file_filter
is not invoked on them. The return value of the internal ls_recursive
was dropped on the floor and the regular file results were duplicated in the callback state.
I assume the intent of putting the callback in this way was so that the C API callback could be interleaved with the ls
expansion? IMO if we want to continue enabling that the best way would be to propagate the separate file/directory filter callbacks all the way up to the C API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we haven't added support for pruning directories with the directory_filter yet. Internally as I'm sure you've noticed this argument defaults to tiledb::sm::accept_all_dirs
so all directories are always accepted, but I see what you are saying. The results are included in the value returned from ls_recursive but not the data managed by the callback. This isn't the case for S3 since there is no concept of directories on S3, we apply the callback to all objects.
I think for now the best way forward is to call the one user supplied callback for directories and files, just pass in a 0
for the object size here. Adding support for the directory_filter could be done separately IMO.
TileDB/tiledb/sm/filesystem/posix.h
Line 389 in 22b2e3b
if (directory_filter(absuri)) { |
If we do this, ls_recursive("root/")
on a test tree with root/subdir/
and root/subdir/file.txt
, both of these will be included in the results. This is what I would expect on posix but worth double-checking with @KiterLuc or @eric-hughes-tiledb on this. If we want to filter out the non-empty directories we could do that with std::filesystem::is_empty
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One callback makes sense, that will entail more substantial changes, which seems fine.
Seems like in that scenario we should change the callback to accept something like a tiledb::common::directory_entry
so the user doesn't have to do additional system calls if they want to ask.
As far as output is concerned - mimicking ls -R
output probably makes sense, right? That includes root/subdir/
and root/subdir/file.txt
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the internal API to have a single PathPredicate P
and made the common case for that be PathFilter = std::function<PathFilterResult(const directory_entry_uri_view &)>
which unifies a bunch of scenarios and adds a lot of flexibility as to how the iteration proceeds. This is a pretty sizable change but it should be covered pretty well by the unit tests.
This also means that decisions about whether to filter empty directories, etc can be pushed closer to the external API rather than embedding them in the directory iteration function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussion this morning: we will revert several changes here in favor of landing a minimal change set in the imminent release. The changes here will be re-homed to another branch where we can make more substantive internal and external API changes in the future.
For this comment in particular we're going to resolve it by calling file_filter
as well on directories.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks really good so far 👍 I made a few comments but mostly should be easy to address. While reviewing I applied a few of my suggestions to make sure everything made sense. I hadn't quite sorted everything out but branch is here if it's helpful
https://github.com/TileDB-Inc/TileDB/tree/smr/rr/sc-42004-ls_recursive-posix
tiledb/api/c_api/vfs/vfs_api.cc
Outdated
vfs->ls_recursive(tiledb::sm::URI(path), callback, data); | ||
|
||
tiledb::sm::LsObjects lsresults; | ||
throw_if_not_ok(vfs->ls_recursive(tiledb::sm::URI(path), &lsresults)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we haven't added support for pruning directories with the directory_filter yet. Internally as I'm sure you've noticed this argument defaults to tiledb::sm::accept_all_dirs
so all directories are always accepted, but I see what you are saying. The results are included in the value returned from ls_recursive but not the data managed by the callback. This isn't the case for S3 since there is no concept of directories on S3, we apply the callback to all objects.
I think for now the best way forward is to call the one user supplied callback for directories and files, just pass in a 0
for the object size here. Adding support for the directory_filter could be done separately IMO.
TileDB/tiledb/sm/filesystem/posix.h
Line 389 in 22b2e3b
if (directory_filter(absuri)) { |
If we do this, ls_recursive("root/")
on a test tree with root/subdir/
and root/subdir/file.txt
, both of these will be included in the results. This is what I would expect on posix but worth double-checking with @KiterLuc or @eric-hughes-tiledb on this. If we want to filter out the non-empty directories we could do that with std::filesystem::is_empty
.
a80f644
to
4721c28
Compare
@@ -348,7 +348,7 @@ TEST_CASE( | |||
bool empty_tile = test == "empty tile"; | |||
|
|||
uint64_t max_string_size = 100; | |||
uint64_t num_strings = 2000; | |||
int num_strings = 2000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't able to reproduce it or figure out why but this line of code started producing warnings in the windows build which failed the CI checks.
C:\Program Files (x86)\Microsoft Visual Studio\2019\Enterprise\VC\Tools\MSVC\14.29.30133\include\xutility(135,1): warning C4267: 'initializing': conversion from 'size_t' to '_Ty', possible loss of data [D:\a\tdbbd\tiledb\test\tiledb_unit.vcxproj]
with
[
_Ty=int
] (compiling source file D:\a\TileDB\TileDB\test\src\unit-tile-metadata.cc)
The operation rand() % num_strings
produces a uint64_t
which is truncated to int
when it is pushed onto values
. Since num_strings
is 2000 this probably shouldn't have produced a warning with any compiler, but it was doing so in the msvc
job.
dbfe4c5
to
5a305fb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍 Some minor nits but nothing major.
tiledb::sm::LsObjects out_objs(in_objs); | ||
std::sort( | ||
out_objs.begin(), out_objs.end(), [](const auto& f1, const auto& f2) { | ||
return f1.first < f2.first; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this default behavior for std::sort on vector<pair<string, uint64_t>>
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It probably is.
tiledb::sm::accept_all_files, | ||
tiledb::sm::accept_all_dirs)); | ||
|
||
CHECK(ls.size() == 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tiny nit, but can we use REQUIRE
like you did above to avoid the repeated if
checks in the rest of these tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I chose to do it this way because if the REQUIRE
does not pass, then there is no output whatsoever about what the contents of ls
actually are. This way the code is a little uglier but when it fails the first time there is still useful output. I would prefer not to change this.
} | ||
|
||
TEST_CASE("VFS: ls_recursive unfiltered", "[vfs][ls_recursive]") { | ||
std::string prefix = GENERATE("file://"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to run all these tests on S3 too, but I'm ok with doing that as follow up if we want to tackle that separately.
LsObjects std_filesystem_ls_filtered( | ||
const URI& parent, F file_filter, D directory_filter, bool recursive) { | ||
/* | ||
* The input URI was useful to the top-level VFS to identify this is a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tiny nit but I think the convention is to use single line comments unless we're writing doxygen
testpath.touch(true); | ||
} | ||
|
||
SECTION("Directory predicate returning true is filtered from results") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean false
here? Same comment for the SECTION below.
@shaunrd0 previously added
ls_recursive
support for the S3 file system. This pull request implements it for Posix, in a hopefully-portable way (std::filesystem::recursive_directory_iterator
) which can be recycled for Windows.Some notes about input/output:
LsObjects
has astd::string
component, the string is a URI.The recursion is implemented so that if a directory does not pass the directory filter, then we do not descend into that directory. I have added a unit test to check this behavior. Something like this could be useful for S3 as well if the "list objects" request is done in hierarchical mode.
TYPE: FEATURE
DESC: Support VFS
ls_recursive
API for posix filesystem.