Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto File Management Part 2: Allow Grackle to search for automatically managed data files #237

Open
wants to merge 24 commits into
base: main
Choose a base branch
from

Conversation

mabruzzo
Copy link
Collaborator

@mabruzzo mabruzzo commented Sep 4, 2024

This is a followup to PR #235 (it includes all the changes from #235 and should be reviewed afterwards)

Overview

PR #235 introduced a command-line-tool interface that is integrated within pygrackle that is responsible for managing Grackle's data files in a standardized location (in a way that is compatible with having multiple Grackle versions installed).

This PR makes it possible for the Grackle library to automatically lookup a data file from this standard location, without specifying the full path. At the moment, this functionality is most useful when used with pygrackle. (A followup PR will make it easier to use this new functionality when pygrackle isn't installed). This new functionality is enabled by the new grackle_data_file_options parameter (by default, this behavior is disabled).

How this automatic lookup works

In this section, we discuss how this feature works when it is enabled (more on how to do that in the next section).

When this feature is enabled, the value of grackle_data_file is not treated as a path. Instead it should exactly specify the name of one of the data files shipped with Grackle (e.g. "CloudyData_UVB=FG2011.h5", "CloudyData_UVB=HM2012.h5", "cloudy_metals_2008_3D.h5").

When you invoke initialize_chemistry_data (and this functionality is enabled) grackle invokes the following search procedure:

  1. First, it checks whether the name of the datafile exactly matches one of the standard data files shipped with the current version of Grackle.

    • The list of filenames is automatically encoded in the c library at compile-time based on the list of specified in the file_registry.txt file that was introduced in Auto File Management Part 1: Introducing a Datafile Management Tool #235.
    • If the string specified by grackle_data_file does not EXACTLY match a known file, then an error is reported. For safety reasons, if the user specifies a path to a data file, we reject it (e.g. "CloudyData_UVB=FG2011.h5" is ok but "path/to/CloudyData_UVB=FG2011.h5" is NOT).
  2. Next, we determine the standard location where the datafiles should be stored. The C function that does determines this location encodes the same logic as the corresponding python function that is used to manage the datafiles.

  3. Finally we construct the path to the file and ensure that the file has the expected contents

    • my big fear while implementing this is that I would make a mistake in some logic (either the python logic that manages the datafiles or the logic for finding the datafiles) and we would have grackle silently use the wrong datafile (invalidating users' results).
    • as insurance, we validate the file's known checksum. Earlier we mentioned that we encoded the known filenames directly into the C library. At the same time, we also encode the known checksum (which is also listed in the aforementioned file_registry.txt file).
    • to actually compute the checksum we use the functions provided by the open-source picohash c library. Since this "library" is just a single header-file, we actually ship it as a part of Grackle.1 Whether the CMake build system or classic build system is used, the functionality is included into grackle without any extra steps.

How to enable automatic lookup

To enable/disable this feature, you need to assign grackle_data_file_options a constant-value encoded by one of the following macros:

  • GR_DFOPT_FULLPATH_NO_CKSUM: In this case we assume that grackle_data_file encodes the full path to a file. When no value is provided, we default to this case. This is the classic behavior
  • GR_DFOPT_MANAGED: this unlocks the new functionality described in this PR. In the unlikely event that different grackle versions ship different versions of a datafile, we will always load the standard datafile contemporaneous with the current version of grackle.
  • GR_DFOPT_MANAGED_NO_CKSUM: does the same thing as the former case, but doesn't do any checksum calculation and validation. This is provided in case the user is working on a "fragile" parallel filesystem (like the one on frontera) and wants to minimize the file system operations for some of their MPI processes2

In pygrackle, these values are accessed through the new constants object. For example,

Footnotes

  1. I'm somewhat tempted to use this alternative library. To do that we would need to change the checksums from SHA-1 to SHA-256. But I think that would be fine. We also discuss making this change, for separate reasons in Auto File Management Part 1: Introducing a Datafile Management Tool #235.

  2. If the user decides to use this on all MPI ranks in place of GR_DFOPT_STANDARD_CONTEMPORANEOUS, then they are accepting any risk associated with (hypothetical) bugs that could lead to reading the wrong file. (This is unlikely, but in this scenario, the blame entirely lies with the user).

We plan to eventually install the grdata tool as a standalone command
line program. Essentially the build-system will perform some
substitutions (the CMake build system uses CMake's built-in
``configure_file`` command while the classic build system uses the
analogous ``configure_file.py`` script)

This commit introduces a few minor tweaks to grdata.py so that it can
more easily be consumed by the ``configure_file.py`` script.
- The ``configure_file.py`` script, itself, will ultimately require a
  few more tweaks so that it doesn't report occurences of python's
  decorator-syntax as errors
- However, this commit minimizes the number of required changes
Among other things, we started using picohash and using the functions in
os_utils.ch
The file registry is encoded in the autogenerated file_registry.h file
that is produced from file_registry.h.in.

To get this to work properly for the Makefile build-system, I needed to
add a new feature to ``configure_file.py``. In detail:

* ``configure_file.py`` already provided the option to replace a
  variable in a template file with multiple lines of content read from
  an external file. We assumed that this option would only be used for
  formatting multiline strings in printf statements. Consequently, the
  machinery would replace any new-line characters encountered in the
  external file with the "\n" escape-sequence used in C strings to
  represent a new-line.

* I added simply added the option to ``configure_file.py`` to do the
  same thing WITHOUT escaping new-line characters.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant