Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support VFS ls_recursive API for posix filesystem. #4778

Merged
merged 2 commits into from
Mar 11, 2024

Conversation

rroelke
Copy link
Contributor

@rroelke rroelke commented Mar 1, 2024

@shaunrd0 previously added ls_recursive support for the S3 file system. This pull request implements it for Posix, in a hopefully-portable way (std::filesystem::recursive_directory_iterator) which can be recycled for Windows.

Some notes about input/output:

  • while LsObjects has a std::string component, the string is a URI.
  • for posix, we include empty directories in the result set.

The recursion is implemented so that if a directory does not pass the directory filter, then we do not descend into that directory. I have added a unit test to check this behavior. Something like this could be useful for S3 as well if the "list objects" request is done in hierarchical mode.


TYPE: FEATURE
DESC: Support VFS ls_recursive API for posix filesystem.

@rroelke rroelke requested a review from shaunrd0 March 1, 2024 19:16
Copy link

This pull request has been linked to Shortcut Story #42004: Implement ls_recursive support for LocalFs.

@rroelke
Copy link
Contributor Author

rroelke commented Mar 1, 2024

I gotta set up clang-format to run automatically, looks like I missed a file :/

We also discussed adding tests to test/src/unit-cppapi-vfs.cc which I just noticed I forgot to do before creating this PR.

vfs->ls_recursive(tiledb::sm::URI(path), callback, data);

tiledb::sm::LsObjects lsresults;
throw_if_not_ok(vfs->ls_recursive(tiledb::sm::URI(path), &lsresults));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the reviewers:

This change was one of the ways to get the empty subdirectories to appear in the external API output (they appear just fine in the internal API output).

Previously the callback supplied here found its way down the call stack and was being used as the file_filter argument to the internal API. We wouldn't see empty directories (or any directories) for that, of course, since the the file_filter is not invoked on them. The return value of the internal ls_recursive was dropped on the floor and the regular file results were duplicated in the callback state.

I assume the intent of putting the callback in this way was so that the C API callback could be interleaved with the ls expansion? IMO if we want to continue enabling that the best way would be to propagate the separate file/directory filter callbacks all the way up to the C API.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we haven't added support for pruning directories with the directory_filter yet. Internally as I'm sure you've noticed this argument defaults to tiledb::sm::accept_all_dirs so all directories are always accepted, but I see what you are saying. The results are included in the value returned from ls_recursive but not the data managed by the callback. This isn't the case for S3 since there is no concept of directories on S3, we apply the callback to all objects.

I think for now the best way forward is to call the one user supplied callback for directories and files, just pass in a 0 for the object size here. Adding support for the directory_filter could be done separately IMO.

if (directory_filter(absuri)) {

If we do this, ls_recursive("root/") on a test tree with root/subdir/ and root/subdir/file.txt, both of these will be included in the results. This is what I would expect on posix but worth double-checking with @KiterLuc or @eric-hughes-tiledb on this. If we want to filter out the non-empty directories we could do that with std::filesystem::is_empty.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One callback makes sense, that will entail more substantial changes, which seems fine.

Seems like in that scenario we should change the callback to accept something like a tiledb::common::directory_entry so the user doesn't have to do additional system calls if they want to ask.

As far as output is concerned - mimicking ls -R output probably makes sense, right? That includes root/subdir/ and root/subdir/file.txt.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the internal API to have a single PathPredicate P and made the common case for that be PathFilter = std::function<PathFilterResult(const directory_entry_uri_view &)> which unifies a bunch of scenarios and adds a lot of flexibility as to how the iteration proceeds. This is a pretty sizable change but it should be covered pretty well by the unit tests.

This also means that decisions about whether to filter empty directories, etc can be pushed closer to the external API rather than embedding them in the directory iteration function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussion this morning: we will revert several changes here in favor of landing a minimal change set in the imminent release. The changes here will be re-homed to another branch where we can make more substantive internal and external API changes in the future.

For this comment in particular we're going to resolve it by calling file_filter as well on directories.

Copy link
Contributor

@shaunrd0 shaunrd0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good so far 👍 I made a few comments but mostly should be easy to address. While reviewing I applied a few of my suggestions to make sure everything made sense. I hadn't quite sorted everything out but branch is here if it's helpful

https://github.com/TileDB-Inc/TileDB/tree/smr/rr/sc-42004-ls_recursive-posix

tiledb/api/c_api/vfs/vfs_api_experimental.h Outdated Show resolved Hide resolved
tiledb/api/c_api/vfs/vfs_api_internal.h Outdated Show resolved Hide resolved
tiledb/sm/filesystem/posix.h Outdated Show resolved Hide resolved
test/src/unit-cppapi-vfs.cc Outdated Show resolved Hide resolved
tiledb/sm/filesystem/posix.h Outdated Show resolved Hide resolved
tiledb/sm/cpp_api/vfs.h Outdated Show resolved Hide resolved
vfs->ls_recursive(tiledb::sm::URI(path), callback, data);

tiledb::sm::LsObjects lsresults;
throw_if_not_ok(vfs->ls_recursive(tiledb::sm::URI(path), &lsresults));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we haven't added support for pruning directories with the directory_filter yet. Internally as I'm sure you've noticed this argument defaults to tiledb::sm::accept_all_dirs so all directories are always accepted, but I see what you are saying. The results are included in the value returned from ls_recursive but not the data managed by the callback. This isn't the case for S3 since there is no concept of directories on S3, we apply the callback to all objects.

I think for now the best way forward is to call the one user supplied callback for directories and files, just pass in a 0 for the object size here. Adding support for the directory_filter could be done separately IMO.

if (directory_filter(absuri)) {

If we do this, ls_recursive("root/") on a test tree with root/subdir/ and root/subdir/file.txt, both of these will be included in the results. This is what I would expect on posix but worth double-checking with @KiterLuc or @eric-hughes-tiledb on this. If we want to filter out the non-empty directories we could do that with std::filesystem::is_empty.

@rroelke rroelke force-pushed the rr/sc-42004-ls_recursive-posix branch from a80f644 to 4721c28 Compare March 7, 2024 22:16
@@ -348,7 +348,7 @@ TEST_CASE(
bool empty_tile = test == "empty tile";

uint64_t max_string_size = 100;
uint64_t num_strings = 2000;
int num_strings = 2000;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't able to reproduce it or figure out why but this line of code started producing warnings in the windows build which failed the CI checks.

C:\Program Files (x86)\Microsoft Visual Studio\2019\Enterprise\VC\Tools\MSVC\14.29.30133\include\xutility(135,1): warning C4267: 'initializing': conversion from 'size_t' to '_Ty', possible loss of data [D:\a\tdbbd\tiledb\test\tiledb_unit.vcxproj]
          with
          [
              _Ty=int
          ] (compiling source file D:\a\TileDB\TileDB\test\src\unit-tile-metadata.cc)

The operation rand() % num_strings produces a uint64_t which is truncated to int when it is pushed onto values. Since num_strings is 2000 this probably shouldn't have produced a warning with any compiler, but it was doing so in the msvc job.

@rroelke rroelke force-pushed the rr/sc-42004-ls_recursive-posix branch from dbfe4c5 to 5a305fb Compare March 8, 2024 04:03
Copy link
Contributor

@shaunrd0 shaunrd0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍 Some minor nits but nothing major.

tiledb/sm/filesystem/vfs.h Outdated Show resolved Hide resolved
tiledb/sm/filesystem/vfs.h Outdated Show resolved Hide resolved
tiledb::sm::LsObjects out_objs(in_objs);
std::sort(
out_objs.begin(), out_objs.end(), [](const auto& f1, const auto& f2) {
return f1.first < f2.first;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this default behavior for std::sort on vector<pair<string, uint64_t>>?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It probably is.

tiledb::sm::accept_all_files,
tiledb::sm::accept_all_dirs));

CHECK(ls.size() == 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiny nit, but can we use REQUIRE like you did above to avoid the repeated if checks in the rest of these tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose to do it this way because if the REQUIRE does not pass, then there is no output whatsoever about what the contents of ls actually are. This way the code is a little uglier but when it fails the first time there is still useful output. I would prefer not to change this.

}

TEST_CASE("VFS: ls_recursive unfiltered", "[vfs][ls_recursive]") {
std::string prefix = GENERATE("file://");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to run all these tests on S3 too, but I'm ok with doing that as follow up if we want to tackle that separately.

LsObjects std_filesystem_ls_filtered(
const URI& parent, F file_filter, D directory_filter, bool recursive) {
/*
* The input URI was useful to the top-level VFS to identify this is a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiny nit but I think the convention is to use single line comments unless we're writing doxygen

testpath.touch(true);
}

SECTION("Directory predicate returning true is filtered from results") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean false here? Same comment for the SECTION below.

@KiterLuc KiterLuc changed the title Rr/sc 42004 ls recursive posix Support VFS ls_recursive API for posix filesystem. Mar 11, 2024
@KiterLuc KiterLuc merged commit a477ca6 into dev Mar 11, 2024
57 checks passed
@KiterLuc KiterLuc deleted the rr/sc-42004-ls_recursive-posix branch March 11, 2024 12:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants