Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pathy.exists() check might impact performance due to partial startswith check #109

Open
yaelmi3 opened this issue Oct 22, 2023 · 1 comment
Labels
enhancement New feature or request

Comments

@yaelmi3
Copy link
Contributor

yaelmi3 commented Oct 22, 2023

env: python3.10, tested with GS

Consider the following case:

Pathy("gs://bucket/blob-not-there")

In this case we check whether the exact blob exists , but in case it doesn't exist, we continue to checking partial blob appearance, in all bucket files using startswith. This introduces 2 possible issues:

  1. In case of bucket with high amount of blob (in our case we have bucket with hundred of thousands blobs), this check might be unreasonably long
  2. In case we have a prefix match, exists will return True, but it might not be the blob we are referring to

Possible solutions

  1. Avoid looking for blob prefix
  2. Add a flag to exists, something like exact_match
@justindujardin justindujardin added the enhancement New feature or request label Jan 12, 2024
@justindujardin
Copy link
Owner

@yaelmi3 thanks for providing this review/analysis! 🙇

Could you construct a performance test that measures how slow it is and compare it with your suggested change? I can run it on all the cloud providers to get a sense of the impact if you write a script that works with the local-mode implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants