You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
While testing the Onedata file source plugin, I encountered the following issue: directory listing triggers scandir SIX TIMES (sic!) and results in ambiguous errors (e.g., inability to list a directory due to lack of permissions in another directory). The likely culprit is the recently added _get_total_matches_count method in the PyFilesystem2FilesSource class (PR: #18059):
def _get_total_matches_count(self, fs: FS, path: str, filter: Optional[List[str]] = None) -> int:
# For some reason, using "*" as glob does not return all files and directories, only files.
# So we need to count files and directories "*/" separately.
# Also, some filesystems do not properly support directories count (like Google Cloud Storage),
# so we need to catch TypeError exceptions and fallback to 0.
files_glob_pattern = f"{path}/{filter[0] if filter else '*'}"
try:
files_count = fs.glob(files_glob_pattern).count().files
except TypeError:
files_count = 0
directory_glob_pattern = f"{files_glob_pattern}/"
try:
directories_count = fs.glob(directory_glob_pattern).count().directories
except TypeError:
directories_count = 0
return files_count + directories_count
This method is called during each directory listing:
Besides listing a specific range of files, two additional glob calls are made to count the total number of files in the directory. These glob operations trigger scandir not only on the listed directory but traverse the file tree from the root down to one level deeper than the listed directory.
Given the following directory structure:
root
dir1
dir2
dir21
file
Attempting to list dir2 without having permissions for dir1 or dir21 causes glob to return an error, failing the entire listing process. This can be partially mitigated by providing the path parameter when calling glob (the default is , which causes scanning from the root):
Listing plus glob results in three scandir calls (since glob starts from the root, the number of calls can vary based on the nesting level). Multiply this by two for the first directory listing when the /api/remote_files endpoint is called twice:
First call, probably to get the total file count in the directory (returns total_matches header) - most parameters are undefined (limit == None, offset == None).
Second call to fetch entries for the first window (paging) - offset == 0 and limit == 25.
Switching to the next window lists only the appropriate chunk (e.g., offset == 50 and limit == 25). Nevertheless, glob is still executed.
Galaxy Version and/or server at which you observed the bug
Galaxy Version: 24.2.dev0
Commit: ed117ef
To Reproduce
Steps to reproduce the behavior:
List any source plugin in Web UI and log PyFilesystem2FilesSource._list calls
Expected behavior
Reconsider the idea of counting files. For vast directories (which happen in scientific datasets) this operation may be very costly - but is performed only to know the count of elements. Of course, the pagination would have to be reworked, as it is not possible to know the number of pages beforehand. Some ideas:
Count only up to 1000 elements. For smaller (<=1000) directories, it can work as currently. Treat bigger (>1000) directories as possibly "infinite". Most directories are less than a 1000 elements, and for bigger ones, navigation using pages may be not viable anyway.
For "infinite" directories, allow jumping to an arbitrary page - if it exceeds the actual range, display the last page.
Implement alternative navigation such as "jump to" where the user specifies a lexicographic prefix.
Separate Endpoint for glob: If glob is needed only for the initial listing to determine the total file count in the directory and is not used afterward, consider implementing a separate endpoint called once before the actual listing.
Use scandir or filterdir Directly: Instead of glob, which iterates scandir from the root, consider using scandir or filterdir directly:
sum(1 for _ in h.filterdir(path, namespaces=["details"], files=filter, dirs=filter))
Unfortunately, this also lists all files in the directory. It is unclear whether counting files without listing them is feasible, especially in the generic PyFilesystem2FilesSource.
The text was updated successfully, but these errors were encountered:
The additional unpaginated request was not to retrieve the counting, it was a bug in the UI when setting up the items provider and should be fixed by #18387. This was, probably, the most impacting issue.
Thanks again for the very detailed report, it helped a lot.
Hopefully, with #18372 and #18387 this is in better shape now.
We could think of some other ways to improve performance in the dev branch but for a fix to 24.1 this seems reasonable so I will close it. Please feel free to reopen if it is still a concern.
Describe the bug
While testing the Onedata file source plugin, I encountered the following issue: directory listing triggers scandir SIX TIMES (sic!) and results in ambiguous errors (e.g., inability to list a directory due to lack of permissions in another directory). The likely culprit is the recently added _get_total_matches_count method in the PyFilesystem2FilesSource class (PR: #18059):
This method is called during each directory listing:
Besides listing a specific range of files, two additional glob calls are made to count the total number of files in the directory. These glob operations trigger scandir not only on the listed directory but traverse the file tree from the root down to one level deeper than the listed directory.
Given the following directory structure:
Attempting to list dir2 without having permissions for dir1 or dir21 causes glob to return an error, failing the entire listing process. This can be partially mitigated by providing the path parameter when calling glob (the default is , which causes scanning from the root):
files_count = fs.glob(files_glob_pattern, path=path).count().files
However, this does not solve the second problem.
Listing plus glob results in three scandir calls (since glob starts from the root, the number of calls can vary based on the nesting level). Multiply this by two for the first directory listing when the /api/remote_files endpoint is called twice:
Switching to the next window lists only the appropriate chunk (e.g., offset == 50 and limit == 25). Nevertheless, glob is still executed.
Galaxy Version and/or server at which you observed the bug
Galaxy Version: 24.2.dev0
Commit: ed117ef
To Reproduce
Steps to reproduce the behavior:
Expected behavior
sum(1 for _ in h.filterdir(path, namespaces=["details"], files=filter, dirs=filter))
Unfortunately, this also lists all files in the directory. It is unclear whether counting files without listing them is feasible, especially in the generic PyFilesystem2FilesSource.
The text was updated successfully, but these errors were encountered: