Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ProtocolDAGResult caching to user-facing client #58

Open
dotsdl opened this issue Jan 10, 2023 · 3 comments · May be fixed by #271
Open

Add ProtocolDAGResult caching to user-facing client #58

dotsdl opened this issue Jan 10, 2023 · 3 comments · May be fixed by #271

Comments

@dotsdl
Copy link
Member

dotsdl commented Jan 10, 2023

ProtocolDAGResults are pulled by users to evaluate free energy differences for Transformations they are interested in. These can be rather large, and many can be pulled for a given Transformation in a single client method call. This can cause slow behavior for users, killing productivity.

Because ProtocolDAGResults never change once created, there is no risk of a stale cache. This make cache invalidation a non-issue for these objects.

Caching should be two-tier:

  1. an in-memory cache, of finite size; this can be a LRU cache, for example
  2. an on-disk cache, of finite size; max size can be set on client instantiation

When a ProtocolDAGResult ScopedKey is slated for retrieval:

  • (1) is hit first; if present, the ProtocolDAGResult is returned; if not,
  • (2) is hit; if present, the ProtocolDAGResult is returned; if not,
  • the request goes to the user-facing API
@ianmkenney
Copy link
Collaborator

@dotsdl it looks like we're already doing (a)LRU caching for PDRs (although we have up to 10000 in maxsize, which I think is pretty aggressive). That said, we could just add a disk check inside the relevant methods. I think that will achieve what you proposed. Does that sound good to you?

@dotsdl
Copy link
Member Author

dotsdl commented Apr 24, 2024

@ianmkenney our use of the (a)LRU cache there is rather crude, since usages of AlchemiscaleClient.get_network_results (a method commonly used by users to pull all results for a network) uses a ProcessPoolExecutor for retrieval and this may mean that each subprocess gets its own in-memory cache.

Do you think it's feasible to create a single in-memory cache using e.g. a WeakValueDictionary that child processes would be able to automatically use (at least on POSIX systems)? That should generally yield higher performance with fewer in-memory cache misses when using get_network_results, requiring less use of the disk cache for repeated calls by the subprocesses. Thoughts?

@dotsdl
Copy link
Member Author

dotsdl commented Apr 24, 2024

Can we also expose the cache settings (e.g. location, max records, etc.) as kwargs to the AlchemiscaleClient so users can alter these if needed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment