Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

State-Dumping Debug Tool #197

Draft
wants to merge 32 commits into
base: main
Choose a base branch
from

Conversation

mabruzzo
Copy link
Collaborator

@mabruzzo mabruzzo commented Apr 25, 2024

To make this PR easier to review, I split off some of the changes into #205. That should be reviewed first.

Description

This PR introduces "tooling" to help with Diagnosing and Debugging Grackle errors.

Motivation

One of Grackle's greatest weaknesses is that it's really hard to diagnose and debug errors:

  • This is a problem that occurs in any heating/cooling library because the errors tend to crop up in the middle of a simulation run, which makes them hard to diagnose/debug without rerunning the whole simulation
  • In principle, this should actually be a selling point for Grackle both because (i) there is a broader pool of people with a vested interest in fixing these issues and (ii) because it is relatively easy to run grackle in isolation (compared to some custom heating/cooling library that are deeply integrated within a downstream simulation code)
  • However, in practice getting help is really tricky! There are 2 ways to do it: (i) the person who encountered the error needs to be really motivated and figure out a simple way to reliably reproduce the problem OR (ii) someone with detailed knowledge of grackle needs to iterate with the person who encountered the issue to tease out the source of the problem

This PR introduces "tooling" to help address this challenge. We primarily introduce a C function to dump the internal state of the hdf5 file. Then we introduced convenience python functions (that we need anyway to test this functionality) to reproduce grackle's state from the hdf5 file and make it easy to try to use that to reproduce the issue.

This makes it easy for users to communicate enough information to get help and for the people who are actually trying to debug the problem.

Detailed Description

In this section I will provide some more details about the changes. See the website docs for a more detailed description of what they do and how to use them.

Essentially this PR seeks to introduce a new function to assist with debugging:

int grunstable_h5dump_state(const char* fname, long long dest_hid,
                            const chemistry_data* chemistry_data,
                            const code_units* initial_code_units,
                            const code_units* current_code_units,
                            const grackle_field_data* my_fields);

The idea here is simple: when you are working on reproducing a grackle-error, you can use this to dump grackle's state to an hdf5 file just before the error occurs. Specifically, the data is dumped to a newly created HDF5 (at the path specified by fname) OR within an existing hdf5 file (at a location specified by dest_hid).

To properly dump the state of grackle_field_data, we need people to initialize unused fields to NULL. This is accomplished by a new function called:

int gr_initialize_field_data(grackle_field_data *my_fields);

The idea is to have users immediately call this function right after they initialize a new grackle_field_data instance, analogous to the way that local_initialize_chemistry_parameters is used.1

I also added some python functions to help load in the dumps and use them for debugging

Stability Concerns

I'm very confident about the internal changes to Grackle. They may seem a little complicated, but I think they are robust and provide a lot of flexibility for changing things in the future.

With that said, I have some concerns about stabilizing 2 aspects of this PR. In particular I'm worried about:

  • the interface of grunstable_h5dump_state.
    • It is NOT currently very ergonomic. It requires some extra work from the end user (i.e. tracking the initial code_units state) and could definitely be improved. For example, maybe we have people create a dump right after initialization and then dump again when they think the problem occurs. Or, if we introduce on the changes suggested in Auto-computing the Required Units #198, this could be simplified a lot more.
  • the format used in the hdf5 dumps
    • I'm a little concerned that people might start relying upon this as a serialization format.
    • I actually, think that supporting serialization of certain things could be useful.2

If we needed to guarantee stability, I'm not sure I would ever feel confident proposing these changes. Furthermore, I think these changes would be useful to people, right now, even without a stability guarantee (people don't need to call grunstable_h5dump_state in the mainline version of their simulation code -- there is no need to create these dumps outside of debugging purposes).

Thus, my proposed compromise is to integrate these things into Grackle as unstable features. I've tried to communicate that in the docs.

To make it extra clear that grunstable_h5dump_state isn't stable, I have:

  • placed the declaration in the newly-created grackle_unstable.h public-header (a downstream application would need to explicitly include this header in addition to the regular grackle.h header
  • the names of the functions try to explicitly denote that they are currently unstable:
    • Policy for naming public API functions/structs #189 proposes that we adopt a standard convention that all new functions in the stable public API share a common prefix (gr_ or grackle_)
    • for the sake of this initial proof-of-concept I have prefixed these function names with grunstable_ under the assumption that we'll start prefixing all public components of the stable-public API with gr_
    • if we decide on a different prefix, obviously the grunstable_ prefix would need to change

Other changes

This includes a few other changes:

  • the functions used to dump the contents of the chemistry_data, code_units, and grackle_version structs to stdout (when running grackle in VERBOSE mode), have been refactored to use similar functionality to the hdf5-dumping machinery
    • I actually think this makes a lot of sense and makes maintenance a lot easier
    • One thing to note: the functions slightly changed the output format. It now writes out the information in a json format. (Originally I planned to dump json-strings to hdf5 attributes, but then I decided on something more robust). I can revert this if you would like.
  • I introduced a bunch of extra utility-functionality to pygrackle for the purposes of testing.
    • Things could be implemented a little more concisely for this particular PR. However, I was implementing variations on this functionality over-and-over. Rather than dealing with merge conflicts later, I figured it might make sense to standardize now.3

Future Ideas

It might be nice to ship Grackle with a simple C/C++ program to help with diagnosing issues. More details are provided below the fold.

More about a C/C++ diagnostic tool

In the simplest form, the program

  • would load in these dumps and then executes one of the grackle-calculation functions
  • would need to provide a way to override the grackle_data_file variable
  • if we were willing to give this program access to Grackle's internals, only a few minor tweaks would be necessary to easily reconstruct Grackle's state from the hdf5 dumps (this approach would be extremely maintainable as we add new fields and grackle-parameters)

Currently, the only way to load in the internal state is with the new python functions. This C/C++ program would be a useful addition because:

  • it would make reproducing errors more convenient
    • If you successfully compile grackle, then you basically get this for free
    • In comparison, it can be a little tricky to install pygrackle (hopefully it gets easier). Plus, a user may not want to install pygrackle on a cluster-environment
  • we could more easily reproduce all the conditions where the error first occurred
    • currently pygrackle doesn't support all features of regular grackle (hopefully this gets better)
    • if we start distributing pygrackle on PYPI or conda-forge, there is a much greater likelihood that people will build their simulation and pygrackle with different grackle versions
    • as we start adding gpu support, I imagine that getting python to play nicely with the extended libraries could be tricky...
  • it would be easier to invoke a debugger session with something like gdb

There are also some fairly cool things you could extend this tool to do:

  • you could add in the ability to load in a small slice of the data-dump
  • you could support dumping what was loaded to a new file (this could help with producing more rigorous tests of the debugging functionality AND could help cut out field data where there isn't any problem)
  • you could support a mode that searches index-by-index for the problematic data
  • you could also use this thing for performance benchmarking (since we could easily feed it arbitrary data and don't have to worry about of a python interpreter)

Footnotes

  1. The particular change can be separately reviewed as part of PR Introduce gr_initialize_field_data #205.

  2. In particular, if people serialize just the chemistry_data contents in a standardized way, we could support some cool things with yt analysis (building on existing functionality in pygrackle, or even adding some basic things directly to yt without needing pygrackle)

  3. If I'm feeling motivated, I might introduce a separate pull-request to try to introduce some of this functionality (but I probably won't get to it).

@mabruzzo mabruzzo added enhancement New feature or request proposal labels Apr 25, 2024
mabruzzo added 23 commits May 30, 2024 10:35
…emistry_data and grackle_version to a separate file.
Previously, we were dumping `chemistry_data`, `code_units`, and `grackle_version` to a json-string that was stored in the hdf5 file. While this certainly got the job done, there were 2 drawbacks:

1. we did it in a rather round-about manner...
2. if anyone ever wanted to deserialize the dumps into new instances, you would need to support hdf5 reading AND json-parsing

With this new change, we now dump these structs in a way that individual key-value pairs are stored as attributes. This simplifies hypothetical deserialization
@mabruzzo mabruzzo force-pushed the debugtool-statedump branch from 6d87b1e to 5632e76 Compare May 30, 2024 19:11
@mabruzzo mabruzzo changed the title WIP: State-Dumping Debug Tool State-Dumping Debug Tool Jun 2, 2024
@mabruzzo
Copy link
Collaborator Author

mabruzzo commented Jun 2, 2024

@brittonsmith, When you have time this is now ready for review.

Sorry this got so large.

I think this could create some minor conflicts with #177, but I'm happy to help resolve those (it should be very easy)

mabruzzo added a commit to mabruzzo/grackle that referenced this pull request Jun 3, 2024
This change was split off from GH PR grackle-project#197. This should be a lot easier
to review in isolation (and I think that it is a generally useful
change in its own right).

Overview
--------
The objective of this PR is conceptually very simple. It introduces the
following function:

```c++
int gr_initialize_field_data(grackle_field_data *my_fields);
```

The idea is to have users immediately call this function right after they
initialize a new `grackle_field_data` instance

- When it's called the handful of non data-field members are assigned
  sensible defaults. It also initializes all data-field members to NULL
  pointers.

- this is analogous to the way that `local_initialize_chemistry_parameters`
  is used.

Motivation
----------

While this function is not strictly necessary, I think it is a useful
addition and I think we should tell people to use it (though old code
won't break).

A large motivation is peace of mind. I have often wished for this sort
of thing. I'm always a little concerned when using a new configuration
of grackle that I didn't initialize all of the required fields. I
would definitely feel a little more comfortable knowing that a field
that I forgot about is assigned a ``NULL`` pointer rather than some
garbage data that might let the code chug along.

It could also facillitate more error checking:

- such as checks in the style of the ``grid_dx`` check introduced in
  PR grackle-project#190 (applying similar checks to `grid_rank`, `grid_dimension`,
  `grid_start`, and `grid_end` requires that we know their default
  value).

- we could explicitly check that the user specified all required
  data-fields. We can only do this if we know that unset data-fields
  have a default value of `NULL`.

  - **From a user friendliness perspective, I think this check alone is
    enough to justify the function's existence.**

  - We could also potentially warn if unnecessary fields were specified

  - Plus, we could account for the fact that functions like
    ``calculate_pressure`` require fewer fields that ``solve_chemistry``

- we could also add checks that none of the fields are aliases (an
  implicit Fortran requirement). This would be much easier to implement
  if we knew we could just ignore fields that were set to ``NULL``
  pointers

We would probably need to make these checks opt-in somehow, both because
we can't be absolutely certain whether a user actually invoked
``gr_initialize_field_data`` and users might not want the runtime-cost
(which could be relatively large for very small grids)

Naming
------

I choose to name this function based on my suggestion in grackle-project#189 that we
adopt a convention that all new functions in the stable public API share
a common prefix (`gr_` or `grackle_`).

For the sake of proposing something concrete, I assumed that we choose
the `gr_` prefix. (If we decide on a different prefix, this would be
easy to change).

Alternatively, if we don't want any prefix, we could name it
`initialize_grackle_field_data` (I do worry a little that could
introduce naming conflicts with existing functions in downstream codes)
mabruzzo added a commit to mabruzzo/grackle that referenced this pull request Jun 3, 2024
This change was split off from GH PR grackle-project#197. This should be a lot easier
to review in isolation (and I think that it is a generally useful
change in its own right).

Overview
--------
The objective of this PR is conceptually very simple. It introduces the
following function:

```c++
int gr_initialize_field_data(grackle_field_data *my_fields);
```

The idea is to have users immediately call this function right after they
initialize a new `grackle_field_data` instance

- When it's called the handful of non data-field members are assigned
  sensible defaults. It also initializes all data-field members to NULL
  pointers.

- this is analogous to the way that `local_initialize_chemistry_parameters`
  is used.

Motivation
----------

While this function is not strictly necessary, I think it is a useful
addition and I think we should tell people to use it (though old code
won't break).

A large motivation is peace of mind. I have often wished for this sort
of thing. I'm always a little concerned when using a new configuration
of grackle that I didn't initialize all of the required fields. I
would definitely feel a little more comfortable knowing that a field
that I forgot about is assigned a ``NULL`` pointer rather than some
garbage data that might let the code chug along.

It could also facillitate more error checking:

- such as checks in the style of the ``grid_dx`` check introduced in
  PR grackle-project#190 (applying similar checks to `grid_rank`, `grid_dimension`,
  `grid_start`, and `grid_end` requires that we know their default
  value).

- we could explicitly check that the user specified all required
  data-fields. We can only do this if we know that unset data-fields
  have a default value of `NULL`.

  - **From a user friendliness perspective, I think this check alone is
    enough to justify the function's existence.**

  - We could also potentially warn if unnecessary fields were specified

  - Plus, we could account for the fact that functions like
    ``calculate_pressure`` require fewer fields that ``solve_chemistry``

- we could also add checks that none of the fields are aliases (an
  implicit Fortran requirement). This would be much easier to implement
  if we knew we could just ignore fields that were set to ``NULL``
  pointers

We would probably need to make these checks opt-in somehow, both because
we can't be absolutely certain whether a user actually invoked
``gr_initialize_field_data`` and users might not want the runtime-cost
(which could be relatively large for very small grids)

Naming
------

I choose to name this function based on my suggestion in grackle-project#189 that we
adopt a convention that all new functions in the stable public API share
a common prefix (`gr_` or `grackle_`).

For the sake of proposing something concrete, I assumed that we choose
the `gr_` prefix. (If we decide on a different prefix, this would be
easy to change).

Alternatively, if we don't want any prefix, we could name it
`initialize_grackle_field_data` (I do worry a little that could
introduce naming conflicts with existing functions in downstream codes)
@mabruzzo mabruzzo mentioned this pull request Jun 24, 2024
1 task
@brittonsmith
Copy link
Contributor

@mabruzzo, I will try to review this when I am back from holiday in a couple weeks. If you could resolve the current merger conflicts before then, that would be great.

@mabruzzo
Copy link
Collaborator Author

I haven't quite gotten around to this yet (hopefully tomorrow or the day after)! There are a number of structural improvements that could be made. In particular #209 will simplify the API a lot!

@mabruzzo mabruzzo marked this pull request as draft July 22, 2024 00:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request proposal
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants