Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support listing file sources asynchronously #19256

Closed

Conversation

kysrpex
Copy link
Contributor

@kysrpex kysrpex commented Dec 5, 2024

The method list() from galaxy.files.sources.BaseFilesSource lists the directories and files within a file source. An optional keyword argument recursive (False by default) lets it recursively retrieve directories and files within a specific directory.

This operation is very cheap in terms of CPU and expensive in IO terms, be it network or filesystem IO. Depending on how the underlying system is built, it may support retrieving directories and files recursively or not. If it does not, then every time a directory is listed, it is necessary to make another request to list each subdirectory. This may end up involving hundreds of requests. Done sequentially, this can be extremely slow, especially if each one involves network access.

This PR makes the list() method asynchronous, which enables Galaxy to wait for the underlying system to complete the requests concurrently, resulting in a massive speedup. The price to pay is the extra complexity of using the async primitives.

Since this change implies that all functions in the chain up to the API endpoints and the test functions must also be made asynchronous, this PR also takes care of it.

The changes from this PR are meant to address the friction that arises when integrating Galaxy with eLabFTW. Due to how the eLabFTW API is designed, listing the file source with recursive=True requires sending one request for each experiment or resource that contains at least one attachment. Given a large enough eLabFTW instance, this easily translates to hundreds of requests, and the whole process takes an eternity when done serially. Making so many requests concurrently is still not ideal, but at least listing ~500 experiments or resources with attachments becomes bearable.

This is the second PR of a series of PRs that integrate eLabFTW with Galaxy via a file source (together they address issue #18665):

How to test the changes?

(Select all options that apply)

  • I've included appropriate automated tests.
  • This is a refactoring of components with existing test coverage.
  • Instructions for manual testing are as follows:
    1. [add testing steps and prerequisites here if you didn't write automated tests covering all your changes]

License

  • I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

The method `list()` from `galaxy.files.sources.BaseFilesSource` lists the directories and files within a file source. An optional keyword argument `recursive` (`False` by default) lets it recursively retrieve directories and files within a specific directory.

This operation is very cheap in terms of CPU and expensive in IO terms, be it network or filesystem IO. Depending on how the underlying system is built, it may support retrieving directories and files recursively or not. If it does not, then every time a directory is listed, it is necessary to make another request to list each subdirectory. This may end up involving hundreds of requests. Done sequentially, this can be extremely slow, especially if each one involves network access.

This commit makes the `list()` method asynchronous, which enables Galaxy to wait for the underlying system to complete the requests concurrently, resulting in a massive speedup. The price to pay is the extra complexity of using the async primitives.

Since this change implies that all functions in the chain up to the API endpoints and the test functions must also be made asynchronous, this commit also takes care of it.
kysrpex added a commit to kysrpex/galaxyproject-galaxy that referenced this pull request Dec 16, 2024
…nc route thread

FastAPI runs sync routes on a separate thread (from a threadpool), and async routes on the main thread's event loop. Originally, it was conceived to convert `_list` to an async method (see galaxyproject#19256). However, given that all other file source plugins are blocking, they could block the main thread for a significant amount of time (see fastapi/fastapi#3091). Thus, the implementation below creates a new event loop within the sync route's thread so that the eLabFTW plugin can still send concurrent requests without blocking the main thread.
@kysrpex
Copy link
Contributor Author

kysrpex commented Dec 16, 2024

This PR is no longer needed, see this comment.

@kysrpex kysrpex closed this Dec 16, 2024
@kysrpex kysrpex deleted the file_sources_asynchronous_listing branch December 17, 2024 14:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant