Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkgset.FileCache: implement __contains__ for efficiency #267

Merged

Conversation

AdamWill
Copy link
Contributor

@AdamWill AdamWill commented Nov 19, 2024

See https://pagure.io/pungi/pull-request/1796 . We figured out there that foo in somefilecache works, but is much less performant than foo in somefilecache.file_cache, because the latter hits the efficient __contains__ method that Python dicts have, but the former winds up iterating over the dict's keys via FileCache.__iter__. In an extreme case, this was making something take 20 minutes instead of 3 seconds.

Users can avoid this by just hitting the file_cache directly as in the PR, but to fix this for any other users without them having to change, let's implement __contains__ in FileCache by just passing it through to the dict.

@@ -229,6 +229,9 @@ def __getitem__(self, name):
def __setitem__(self, name, value):
self.file_cache[os.path.abspath(name)] = value

def __contains__(self, item):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must say, the mismatch between getitem/setitem and contains is concerning.

Keys are converted into absolute paths in the getter/setter but they are NOT converted during a contains check.
This means the basic contract of "inserting an object into a container means the container contains it" is violated:

from kobo.pkgset import FileCache

fc = FileCache()

fc['/y'] = object()
# Is the thing we just inserted in there?
# Of course it is...
assert '/y' in fc

# How about now?
fc['x'] = object()
# Nope - CRASH! I asked for 'x' but it stored '$PWD/x'
# and did not do the same transformation during 'contains'.
assert 'x' in fc  

I initially thought that your change introduced this problem, however the problem is already present on the current foo in self.file_cache path making use of __iter__, the code just wasn't visible.

I guess we should not attempt to fix that in this change, since it might be backwards-incompatible.
But if you are looking at code using this class, maybe you want to take a look at it with this bug in mind.
Any code using this class and not providing absolute paths as input has the potential to behave very strangely.

See https://pagure.io/pungi/pull-request/1796 . We figured out
there that `foo in somefilecache` works, but is much less
performant than `foo in somefilecache.file_cache`, because the
latter hits the efficient `__contains__` method that Python
dicts have, but the former winds up iterating over the dict's
keys via `FileCache.__iter__`. In an extreme case, this was making
something take 20 minutes instead of 3 seconds.

Users can avoid this by just hitting the `file_cache` directly
as in the PR, but to fix this for any other users without them
having to change, let's implement `__contains__` in `FileCache`
by just passing it through to the dict.

Signed-off-by: Adam Williamson <[email protected]>
@rohanpm rohanpm merged commit be7b4bd into release-engineering:master Nov 20, 2024
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants