-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Auto File Management Part 2: Allow Grackle to search for automatically managed data files #237
Open
mabruzzo
wants to merge
24
commits into
grackle-project:main
Choose a base branch
from
mabruzzo:C-auto-data
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
mabruzzo
force-pushed
the
C-auto-data
branch
from
September 4, 2024 16:58
1b19a53
to
b4aef08
Compare
I also added documentation and integrated the tool into the testing framework.
mabruzzo
force-pushed
the
C-auto-data
branch
from
September 18, 2024 00:10
b4aef08
to
1b077ff
Compare
mabruzzo
force-pushed
the
C-auto-data
branch
from
September 20, 2024 12:50
1b077ff
to
53c4931
Compare
mabruzzo
force-pushed
the
C-auto-data
branch
from
September 22, 2024 22:04
53c4931
to
99a770e
Compare
We plan to eventually install the grdata tool as a standalone command line program. Essentially the build-system will perform some substitutions (the CMake build system uses CMake's built-in ``configure_file`` command while the classic build system uses the analogous ``configure_file.py`` script) This commit introduces a few minor tweaks to grdata.py so that it can more easily be consumed by the ``configure_file.py`` script. - The ``configure_file.py`` script, itself, will ultimately require a few more tweaks so that it doesn't report occurences of python's decorator-syntax as errors - However, this commit minimizes the number of required changes
Among other things, we started using picohash and using the functions in os_utils.ch
The file registry is encoded in the autogenerated file_registry.h file that is produced from file_registry.h.in. To get this to work properly for the Makefile build-system, I needed to add a new feature to ``configure_file.py``. In detail: * ``configure_file.py`` already provided the option to replace a variable in a template file with multiple lines of content read from an external file. We assumed that this option would only be used for formatting multiline strings in printf statements. Consequently, the machinery would replace any new-line characters encountered in the external file with the "\n" escape-sequence used in C strings to represent a new-line. * I added simply added the option to ``configure_file.py`` to do the same thing WITHOUT escaping new-line characters.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is a followup to PR #235 (it includes all the changes from #235 and should be reviewed afterwards)
Overview
PR #235 introduced a command-line-tool interface that is integrated within pygrackle that is responsible for managing Grackle's data files in a standardized location (in a way that is compatible with having multiple Grackle versions installed).
This PR makes it possible for the Grackle library to automatically lookup a data file from this standard location, without specifying the full path. At the moment, this functionality is most useful when used with pygrackle. (A followup PR will make it easier to use this new functionality when pygrackle isn't installed). This new functionality is enabled by the new
grackle_data_file_options
parameter (by default, this behavior is disabled).How this automatic lookup works
In this section, we discuss how this feature works when it is enabled (more on how to do that in the next section).
When this feature is enabled, the value of
grackle_data_file
is not treated as a path. Instead it should exactly specify the name of one of the data files shipped with Grackle (e.g."CloudyData_UVB=FG2011.h5"
,"CloudyData_UVB=HM2012.h5"
,"cloudy_metals_2008_3D.h5"
).When you invoke
initialize_chemistry_data
(and this functionality is enabled) grackle invokes the following search procedure:First, it checks whether the name of the datafile exactly matches one of the standard data files shipped with the current version of Grackle.
file_registry.txt
file that was introduced in Auto File Management Part 1: Introducing a Datafile Management Tool #235.grackle_data_file
does not EXACTLY match a known file, then an error is reported. For safety reasons, if the user specifies a path to a data file, we reject it (e.g."CloudyData_UVB=FG2011.h5"
is ok but"path/to/CloudyData_UVB=FG2011.h5"
is NOT).Next, we determine the standard location where the datafiles should be stored. The C function that does determines this location encodes the same logic as the corresponding python function that is used to manage the datafiles.
Finally we construct the path to the file and ensure that the file has the expected contents
file_registry.txt
file).How to enable automatic lookup
To enable/disable this feature, you need to assign
grackle_data_file_options
a constant-value encoded by one of the following macros:GR_DFOPT_FULLPATH_NO_CKSUM
: In this case we assume thatgrackle_data_file
encodes the full path to a file. When no value is provided, we default to this case. This is the classic behaviorGR_DFOPT_MANAGED
: this unlocks the new functionality described in this PR. In the unlikely event that different grackle versions ship different versions of a datafile, we will always load the standard datafile contemporaneous with the current version of grackle.GR_DFOPT_MANAGED_NO_CKSUM
: does the same thing as the former case, but doesn't do any checksum calculation and validation. This is provided in case the user is working on a "fragile" parallel filesystem (like the one on frontera) and wants to minimize the file system operations for some of their MPI processes2In pygrackle, these values are accessed through the new
constants
object. For example,Footnotes
I'm somewhat tempted to use this alternative library. To do that we would need to change the checksums from SHA-1 to SHA-256. But I think that would be fine. We also discuss making this change, for separate reasons in Auto File Management Part 1: Introducing a Datafile Management Tool #235. ↩
If the user decides to use this on all MPI ranks in place of
GR_DFOPT_STANDARD_CONTEMPORANEOUS
, then they are accepting any risk associated with (hypothetical) bugs that could lead to reading the wrong file. (This is unlikely, but in this scenario, the blame entirely lies with the user). ↩