generalizing PyDarshan plotting routines for inter- and intra-log analysis use cases #926

shanedsnyder · 2023-04-17T19:24:29Z

shanedsnyder
Apr 17, 2023
Maintainer

Background

Current PyDarshan plotting routines take a Darshan report object as input. A report object provides a convenient class for managing data associated with a single Darshan log using PyDarshan's lower-level backend interface. So, existing plotting capabilities are ultimately restricted to looking at all records in a single report (e.g, like we do in the job summary tool).

However, it would be nice if we could reuse some of the existing PyDarshan analysis capabilities in other contexts beyond what's currently supported by the report interface:

Analysis of data across multiple Darshan log files:
- Workflow systems comprised of independent tasks that generate their own Darshan logs
- Aggregate analysis of Darshan logs collected at facilities or by other Darshan deployments
Analysis of specific records within a log file:
- Specific file of interest, all '.h5' files, etc.

Proposed solution

To help address this, I propose we modify some of our existing plotting routines to accept more general input than a report, allowing us to support the use cases above. Specifically, I think most of our plotting routines could be updated to accept a single Darshan "record" as input. This would generally allow plotting of:

Darshan data corresponding to a single file record of interest in a single log (rather than all file records as is currently done in PyDarshan job summary tool).
Using the output of the PyDarshan accumulator interface, Darshan data corresponding to arbitrary records that have been accumulated into a single "summary record"
- We are not currently using output from Darshan's accumulator for generating op count plots, access size histograms, etc. in the job summary tool, but this change would make trivial.

(An alternative solution would be to try to generalize the report interface to support data from arbitrary number of logs, but that seems like a much more involved and complicated refactor than what's described here.)

Envisioned usage in the Darshan job summary tool

The envisioned workflow in the job summary tool is actually pretty simple:

Open a report object for a given Darshan log file, reading all records in the log file
For each module that supports accumulation, generate accumulated output formats (derived metrics + summary record)
Generate different plots for each module using data from 2.)
- I/O cost plots, operation count plots, access histograms, and common access tables could all simply use the summary record from the accumulator rather than summarizing data in the report object directly.
- File count summary table and performance estimates can continue using derived metrics directly

Heatmaps and the "File access by category" plots are more complicated, so I think we could just leave them alone for now. Conceptually, it's probably not that hard to extend "file access by category" to ingest data from multiple logs, but that might be easier to sort out once we're more committed to a single internal representation of record data, which is still an ongoing conversation.

shanedsnyder · 2023-04-17T19:25:48Z

shanedsnyder
Apr 17, 2023
Maintainer Author

I'll add a follow-up comment soon with more specifics on what changes are needed for the plots I mentioned above that are in scope for these changes (I/O cost plots, operation count plots, access histograms, and common access tables).

0 replies

tylerjereddy · 2023-04-17T23:53:47Z

tylerjereddy
Apr 17, 2023
Collaborator

I don't think I have any major issue with the general idea, though I think I'd prefer for a bunch of prototying to just happen in some feature branches for a while, and for testing to happen locally with your facilities folks or whoever the consumers are, and then when those routines mature targeted PRs for generalization of individual plotting routines can be made. If folks are genuinely keen to receive the feature, they should be happy to provide timely and effective design feedback for what they need.

That reduces the burden on core reviewers when you're in early testing stages and trying to get feedback from the folks who want this, rather than merging in early and then needing to iterate many times on the routines and their regression tests because the facilities folks want adjustments, etc.

One other idea is to completely separate the two and use an abstraction--the HTML report plotting code could continue to only accept report and just pull out the records before feeding them to the common code, since that workflow should be pretty mature now. That would free up the facilities/multi-log (etc.) development to move quickly/experimentally without really disrupting the summary report API/code, and you could easily add a bunch of arguments to the new/separate plotting functions. You'll end up with higher maintenance burden/duplication though, although a separate namespace for the facilities/multi-log stuff could also have its merits, and development effort might be more clearly segregated based on what folks are interested in working on/reviewing.

I should also point out that for objective 2, we certainly don't need this design adjustment, we could just add kwarg to feed a regex through and filter the dataframes in the current control flow appropriately (which is basically going to have to happen somewhere with this new approach as well).

Let's not forget that we still plan to migrate the accumulator stuff to the dataframe world eventually, and perhaps as the DataFrame API (https://github.com/data-apis/dataframe-api) matures in coming years the accumulation may use any of a variety of accelerated backends beyond vanilla pandas on one or a few threads. The current use of the term record is pretty confusing at the moment FWIW, at least for me.

0 replies

tylerjereddy · 2023-04-18T02:14:54Z

tylerjereddy
Apr 18, 2023
Collaborator

One other useful thing would be to clarify, for any given plot, if the only change you want to make is to change report to the other datastructure of "records," or if you actually want to provide a much larger change--for example more keyword arguments/behavior changes.

I think it is much easier to review/approve the first category of changes, but if you mixin other features it starts to become more of a burden.

For testing, I'd say it would be good to think carefully and creatively about corner cases and how to handle them/test them re: mimicking what might come up in a facility setting and so on (this could be far more difficult than the actual code proper). Also, keep in mind that from a testing standpoint, an individual report is a pretty convient entrypoint, and changing that may require some careful thought to avoid making a mess of things for testing harnesses.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

generalizing PyDarshan plotting routines for inter- and intra-log analysis use cases #926

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

generalizing PyDarshan plotting routines for inter- and intra-log analysis use cases #926

shanedsnyder Apr 17, 2023 Maintainer

Background

Proposed solution

Envisioned usage in the Darshan job summary tool

Replies: 3 comments

shanedsnyder Apr 17, 2023 Maintainer Author

tylerjereddy Apr 17, 2023 Collaborator

tylerjereddy Apr 18, 2023 Collaborator

shanedsnyder
Apr 17, 2023
Maintainer

shanedsnyder
Apr 17, 2023
Maintainer Author

tylerjereddy
Apr 17, 2023
Collaborator

tylerjereddy
Apr 18, 2023
Collaborator