Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PathLike wrapper/cache for ExternalStorage #186

Open
wants to merge 64 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 63 commits
Commits
Show all changes
64 commits
Select commit Hold shift + click to select a range
ffdb901
PathLike wrapper/cache for ExternalStorage
dwhswenson Apr 26, 2023
4ad7fb1
Merge branch 'main' into shared-object-v2
dwhswenson Apr 26, 2023
eb19e0f
mypy
dwhswenson Apr 26, 2023
8e6a78c
Merge branch 'shared-object-v2' of github.com:OpenFreeEnergy/gufe int…
dwhswenson Apr 26, 2023
3be13c0
docstrings
dwhswenson Apr 27, 2023
dda02e2
Merge branch 'main' of github.com:OpenFreeEnergy/gufe into shared-obj…
dwhswenson May 17, 2023
472151e
Add StorageManager code
dwhswenson May 30, 2023
b692c2f
rename to Stagine
dwhswenson May 30, 2023
7432ff4
Add tests for _delete_empty_dirs
dwhswenson May 30, 2023
319d0d0
Storage docs
dwhswenson May 31, 2023
5650609
outline of storage manager tests
dwhswenson May 31, 2023
ddbbd19
minor improvements on staging directory
dwhswenson May 31, 2023
c5ce48a
first storage lifecycle test works
dwhswenson May 31, 2023
236b263
Merge branch 'main' of github.com:OpenFreeEnergy/gufe into shared-obj…
dwhswenson May 31, 2023
b805aac
cleanup mypy
dwhswenson May 31, 2023
ed5e83c
change to unit taking in the label
dwhswenson May 31, 2023
6181039
lots of updates; switched to harness for tests
dwhswenson Jun 5, 2023
1880f73
Big reorg for shared overlapping staging
dwhswenson Jun 6, 2023
1e4ca2c
remove _storage_path_conflict
dwhswenson Jun 6, 2023
a6d26f3
docs & types
dwhswenson Jun 6, 2023
aabbc33
docs, types, logging
dwhswenson Jun 6, 2023
b4d73b3
finish TestHoldingOverlapsPermanentStorageManager
dwhswenson Jun 6, 2023
7af006e
mypy
dwhswenson Jun 6, 2023
58a58bc
test_repr
dwhswenson Jun 6, 2023
8e429f5
renaming around DAGContextManager
dwhswenson Jun 6, 2023
b70df48
holding => staging
dwhswenson Jun 7, 2023
08e3ac2
finish docs (I think?)
dwhswenson Jun 7, 2023
ca7871b
remove completed TODO
dwhswenson Jun 7, 2023
2aa0616
start to testing edge case logging
dwhswenson Jun 9, 2023
7c03dcd
Update stagingdirectory.py
richardjgowers Jun 12, 2023
383075e
tests for single_file_transfer logging
dwhswenson Jun 12, 2023
6365398
tests for read-only transfers
dwhswenson Jun 16, 2023
d35bd60
fix repr and cleanup tests
dwhswenson Jun 16, 2023
7cc10f9
test for permanent transfer to external
dwhswenson Jun 16, 2023
ab025f1
test for Permanent delete staging
dwhswenson Jun 17, 2023
cd70ab2
Add test for missing file on cleanup
dwhswenson Jun 17, 2023
ac1b1d0
Merge branch 'shared-object-v2' of github.com:OpenFreeEnergy/gufe int…
dwhswenson Jun 17, 2023
ea054be
Merge branch 'main' of github.com:OpenFreeEnergy/gufe into shared-obj…
dwhswenson Jun 22, 2023
90f2597
get_other_shared to private
dwhswenson Jun 22, 2023
a2e05b2
Merge branch 'main' into shared-object-v2
dwhswenson Jul 6, 2023
80eccc4
Merge branch 'main' into shared-object-v2
dwhswenson Aug 28, 2023
e9ed7a8
Merge branch 'main' of github.com:OpenFreeEnergy/gufe into shared-obj…
dwhswenson Sep 8, 2023
b4e1d42
Merge branch 'main' into shared-object-v2
dotsdl Sep 8, 2023
2b070ca
Merge branch 'shared-object-v2' of github.com:OpenFreeEnergy/gufe int…
dwhswenson Sep 9, 2023
4fd4a66
Merge branch 'main' into shared-object-v2
dotsdl Sep 12, 2023
5e56461
Merge branch 'main' into shared-object-v2
dotsdl Sep 19, 2023
73d3a1e
Merge branch 'main' into shared-object-v2
dwhswenson Nov 3, 2023
524cc6e
Merge branch 'main' of github.com:OpenFreeEnergy/gufe into shared-obj…
dwhswenson Dec 1, 2023
dd1b6dc
updates from other branch
dwhswenson Dec 1, 2023
a575dd3
make mypy happy
dwhswenson Dec 1, 2023
ba6fcff
Merge branch 'main' into shared-object-v2
dwhswenson Dec 1, 2023
5d0df5f
pep8
dwhswenson Dec 1, 2023
1418aee
Merge branch 'shared-object-v2' of github.com:OpenFreeEnergy/gufe int…
dwhswenson Dec 1, 2023
e057332
pep8
dwhswenson Dec 1, 2023
1cfe910
StagingDirectory -> StagingRegistry
dwhswenson Dec 4, 2023
78e003b
remove prefix; remove get_other_shared
dwhswenson Dec 7, 2023
265e786
delete_empty_dirs => keep_empty_dirs
dwhswenson Dec 7, 2023
006d787
Merge branch 'main' into shared-object-v2
dwhswenson Dec 11, 2023
ce12326
Add logging to not clean up registered directory
dwhswenson Dec 11, 2023
9afeb2f
Merge branch 'shared-object-v2' of github.com:OpenFreeEnergy/gufe int…
dwhswenson Dec 11, 2023
aaa2aab
StagingPath.fspath => StagingPath.as_path()
dwhswenson Dec 13, 2023
cf60d1b
pep8
dwhswenson Dec 13, 2023
da9955e
remove fspath from StagingRegistry
dwhswenson Dec 22, 2023
8489c24
Merge branch 'main' into shared-object-v2
dwhswenson Jan 22, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
103 changes: 102 additions & 1 deletion docs/guide/storage.rst
Original file line number Diff line number Diff line change
@@ -1,5 +1,106 @@
The GUFE Storage System
=======================

Storage lifetimes
-----------------

Storage docs.
The storage system in GUFE is heavily tied to the GUFE protocol system. The
key concept here is that the different levels of the GUFE protocol system;
campaign, DAG, and unit; inherently imply different lifetimes for the data
that is created. Those different lifetimes define the stages of the GUFE
storage system.

In an abstract sense, as used by protocol developers, these three levels
correspond to three lifetimes of data:

Comment on lines +13 to +15
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another last chance to rename things, what if we instead just named the storage lifetimes against the thing they're scoped to. So scratch -> unit_storage, shared -> dag_storage and permanent -> campaign_storage. The idea being it's easier to remember their scope.

* ``scratch``: This is temporary data that is only needed for the lifetime
of a :class:`.ProtocolUnit`. This data is not guaranteed to be available
beyond the single :class:`.ProtocolUnit` where it is created, but may be
reused within that :class:`.ProtocolUnit`.
* ``shared``: This is data that is shared between different units in a
:class:`.ProtocolDAG`. For example, a single equilibration stage might be
shared between multiple production runs. The output snapshot of the
equilibration would be suitable for as something to put in ``shared``
data. This data is guaranteed to be present from when it is created until
the end of the :class:`.ProtocolDAG`, but is not guaranteed to exist after
the :class:`.ProtocolDAG` terminates.
* ``permanent``: This is the data that will be needed beyond the scope of a
single rough estimate of the calculation. This could include anything that
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I like "single rough estimate" here, a single estimate might be perfectly fine. Maybe instead, "this is the data that will be available for post-simulation analysis beyond the scope of a single DAG"

an extension of the simulation would require, or things that require
network-scale analysis. Anything stored here will be usable after the
calculation has completed.
Comment on lines +30 to +31
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"after the calculation" could equally apply to a Unit. Maybe instead "Anything stored here will be retrievable after the Protocol estimation has completed"


The ``scratch`` area is always a local directory. However, ``shared`` and
``permanent`` can be external (remote) resources, using the
:class:`.ExternalResource` API.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe mention that whilst these might not be a local directory, they will still act like one?


As a practical matter, the GUFE storage system can be handled with a
:class:`.StorageManager`. This automates some aspects of the transfer
between stages of the GUFE storage system, and simplifies the API for
protocol authors. In detail, this provides protocol authors with
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use "developers" instead of "protocol authors"? I'm thinking developer is more common, someone might get confused thinking "protocol author" is the agent which is submitting protocols or something daft

``PathLike`` objects for ``scratch``, ``shared``, and ``permanent``. All
three of these objects actually point to special subdirectories of the
local scratch space for a specific unit, but are managed by context
managers at the executor level, which handle the process of moving objects
from local staging directories to the actual ``shared`` and ``permanent``
locations, which can be external resources.


External resource utilities
---------------------------

For flexible data storage, GUFE defines the :class:`.ExternalResource` API,
which allows data be stored/loaded in a way that is agnostic to the
underlying data store, as long as the store can be represented as a
key-value store. Withing GUFE, we provide several examples, including
:class:`.FileStorage` and :class:`.MemoryStorage` (primarily useful for
testing.) The specific ``shared`` and ``permanent`` resources, as provided
to the executor, can be instances of an :class:`.ExternalResource`.

.. note::

The ``shared`` space must be a resource where an uploaded object is
instantaneously available, otherwise later protocol units may fail if the
shared result is unavailable. This means that files or approaches based
on ``scp`` or ``sftp`` are fine, but things like cloud storage, where the
existence of a new document may take time to propagate through the
network, are not recommended for ``shared``.


Details: Manangement of the Storage Lifetime
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: management

--------------------------------------------

The concepts of the storage lifetimes are important for protocol authors,
but details of implementation are left to the specific executor. In order to
facilitate correct treatment of the storage lifecycle, GUFE provides a few
helpers. The information in this section is mostly of interest to authors of
executors. The helpers are:

* :class:`.StorageManager`: This is the overall façade interface for
interacting with the rest of the storage lifecycle tools. It provides two
methods to generate context managers; one for the :class:`.ProtocolDAG`
level of the lifecycle, and one for the :class:`.ProtocoUnit` level of the
lifecycle. This class is designed for the use case that the entire DAG is
run in serial within a single process. Subclasses of this can be created
for other execution architectures, where the main logic changes would be
in the methods that return those context managers.
* :class:`.StagingRegistry`: This handles the logic around staging paths
within a :class:`.ProtocolUnit`. Think of this as an abstract
representation of a local directory. Paths within it register with it, and
it handles deletion of the temporary local files when not needed, as well
as the download of remote files when necessary for reading. There are two
important subclasses of this: :class:`.SharedStaging` for a ``shared``
resource, and :class:`.PermanentStaging` for a ``permanent`` resource.
* :class:`.StagingPath`: This represents a file within the
:class:`.StagingRegistry`. It contains both the key (label) used in the
key-value store, as well as the actual local path to the file. When its
``__fspath__`` method is called, it registers itself with its
:class:`.StagingRegistry`, which handles managing it over its lifecycle.

In practice, the executor uses the :class:`.StorageManager` to create a
:class:`.DAGContextManager` at the level of a DAG, and then uses the
:class:`.DAGContextManager` to create a context to run a unit. That context
creates a :class:`.SharedStaging` and a :class:`.PermanentStaging`
associated with the specific unit. Those staging directories, with the
scratch directory, are provided to the :class:`.ProtocolUnit`, so that
these are the only objects protocol authors need to interact with.
Loading