WIP, ENH: add POSIX access pattern plot to darshan job summary #903

shanedsnyder · 2023-02-22T20:55:33Z

See commits for more details.

If things look good, I can add some tests that can at least check some expected summary_record counter values returned by acummulate_records().

A few thoughts:

This is the first plotting routine that takes a dataframe as input (others take a report object). I thought about this a bit, and think it would be good to generalize the plotting routines we provide away from report objects, to allow us to plot data coming from multiple log files. I didn't modify any other plotting routines, but could take a pass at this in another PR if that seems reasonable?
Related to above, but this new plotting routine is hardcoded to get plot data from the first row of the input dataframe for now. That works fine for the accumulator use case obviously since it's just a single row, but I'm not sure how to best generalize this. Seems like it should be easy? Ultimately, I think it would be nice if you could give the plotting routine a single row (in which case it plots the row values directly) or an entire dataframe (in which case it will need to run the accumulator internally to condense down to a single record). Any thoughts on this?
Can we convert the derived metrics structure returned by accumulate_records() into a more Pythonic object? summary_record is a dictionary now (i.e., not a cdata object), so seems that it would be cleaner if both return values are typical Python things, rather than derived_metrics being a C-like structure.

shanedsnyder · 2023-02-22T20:59:02Z

Here's what a new example report looks like for now (e3sm_io_heatmap_only.darshan):

So, obvious things to cleanup are to make sure the legend isn't on top of plot data, and probably to add a page break between plots (or otherwise reorg things so the big table isn't jammed next to this plot).

tylerjereddy

probably to add a page break between plots

On my monitor they aren't too close together--it probably depends on screen size and stuff--probably an html/css layout that auto-adapts or something would be good, but fancier stuff there may be a can of worms (darshan reports on the phone, etc...).

It is good that the plot looks similar vs. old perl report on same log file:

Some thoughts on the new plot:

the 9 consecutive read counts are only visible because of the value labels added above the bar chart values--not sure if a threshold for log scaling might be warranted (I don't really care too much)--you could probably argue about the merits of emphasizing size differences vs. not noticing a non-zero entry...
why isn't the total category equal to the sum of consecutive and sequential (i.e., the read values here?)--if the categories aren't mutually exclusive I wonder if that is worth noting

For the other points, I'd prefer to keep the discussion focused and have issues opened to elaborate on the pros/cons:

I thought about this a bit, and think it would be good to generalize the plotting routines we provide away from report objects, to allow us to plot data coming from multiple log files. I didn't modify any other plotting routines, but could take a pass at this in another PR if that seems reasonable?

I'm perhaps less convinced of the value/priority for now--I think an issue with a clear description of concrete use cases and how the API would work for the current usages is likely in order. I believe some of our plotting routines behave quite differently, so I'm also not entirely convinced this strategy is appropriate in a super broad sense, and I'd also like the plotting functions as simple as possible--avoid doing data analysis inside the plotting function proper where possible.

If it were clear that this reduced the amount of code substantially that might be convincing, though for a infrastructure that is initially "report"/single file focused, it seems likely this would initially grow more complexity to allow generalization, and may bloat plotting functions with shims/analyses that have little to do with plotting proper.

For folks that need functionality this advanced, I think an argument might need to be made that they can't just write their own plotting code? Thinking about the heatmap plotting, my brain melts a bit re: fusing files and DataFrames from multiple sources, and supporting bugs from users related to this kind of thing. Probably other such examples, and cases that are a bit more on the fence. So, my initial thought is--case-by-case basis with strong real-world justification, and try to keep plotting code devoid of analysis complexity and instead accept the final condensed/consensus data structures?

darshan-util/pydarshan/darshan/backend/cffi_backend.py

darshan-util/pydarshan/darshan/cli/summary.py

darshan-util/pydarshan/darshan/experimental/plots/plot_posix_access_pattern.py

darshan-util/pydarshan/darshan/tests/test_lib_accum.py

darshan-util/pydarshan/darshan/experimental/plots/plot_posix_access_pattern.py

shanedsnyder · 2023-02-23T16:16:06Z

I'll respond more to other comments later, but wanted to start here:

I thought about this a bit, and think it would be good to generalize the plotting routines we provide away from report objects, to allow us to plot data coming from multiple log files. I didn't modify any other plotting routines, but could take a pass at this in another PR if that seems reasonable?

I'm perhaps less convinced of the value/priority for now--I think an issue with a clear description of concrete use cases and how the API would work for the current usages is likely in order. I believe some of our plotting routines behave quite differently, so I'm also not entirely convinced this strategy is appropriate in a super broad sense, and I'd also like the plotting functions as simple as possible--avoid doing data analysis inside the plotting function proper where possible.

If it were clear that this reduced the amount of code substantially that might be convincing, though for a infrastructure that is initially "report"/single file focused, it seems likely this would initially grow more complexity to allow generalization, and may bloat plotting functions with shims/analyses that have little to do with plotting proper.

Sure, we could document what each existing plot is expecting in terms of data in an issue or something. At a high-level, all they do now is take a report as input and extract all the data they need from that. I think the change I'm suggesting would be to instead have the caller of the plotting routines responsible for actually extracting the data in the format the plots expect and passing in directly, instead of having the plot routines digging around in report objects. Seems like that would actually simplify plotting code at the expense of requiring callers to be more explicit about the data they want plotted.

Of course, the main driver of this is not really code cleanup, it's just to generalize our plotting routines so that they could conceivably plot data that comes from multiple log files. As it stands, that is impossible, which will be a definite problem going forward. We really need this capability to extend Darshan to be useful in context of analyzing workflows (that generate many log files) or for analyzing tons of logs generated at a facility. I think continuing to tie our plotting infrastructure to single logs would be a mistake in that regard.

For folks that need functionality this advanced, I think an argument might need to be made that they can't just write their own plotting code? Thinking about the heatmap plotting, my brain melts a bit re: fusing files and DataFrames from multiple sources, and supporting bugs from users related to this kind of thing. Probably other such examples, and cases that are a bit more on the fence. So, my initial thought is--case-by-case basis with strong real-world justification, and try to keep plotting code devoid of analysis complexity and instead accept the final condensed/consensus data structures?

The use cases I mention above are definitely of interest to our users. I'm not sure the changes are really that challenging (we'll see, I think the plot in this PR provides some supporting evidence that this could be relatively simple) that we have to punt on this for users. The heatmaps will almost certainly be most involved, but I don't think we have to figure that out all at once. The other plots seem like they would be incredibly straightforward to switch to a dataframe or record based input, which would make them trivial to apply for single- or multi-log use cases. I don't really see the downside other than that it requires a small refactor?

tylerjereddy · 2023-02-23T16:41:56Z

Maybe just shift that stuff to an issue yeah--for code that is basically "produce a bar chart" like here, I don't really even see much value in having it public, could basically just have the aggregator public and a small example in its docstring of how you might condense and plot, freeing us from any plotting code maintenance entirely.

tylerjereddy · 2023-02-23T16:54:55Z

And if DataFrames are embraced more globally in the code base, they already have plotting methods attached to them as well, so calling DataFrame.plot.bar() and so on may very well get users close enough in many cases, just need some small docs that are tested/maintained with examples.

shanedsnyder · 2023-03-31T20:06:43Z

why isn't the total category equal to the sum of consecutive and sequential (i.e., the read values here?)--if the categories aren't mutually exclusive I wonder if that is worth noting

I can see how this could be confusing, as I was convinced there was a problem too before thinking about it more. The issue is that the "sequential" count is inclusive of the "consecutive" count (see the definitions in the figure caption). I.e., you have to subtract the consecutive count from the sequential count to get the count of operations that were only sequential.

There is a 3rd implicit category not represented which covers what is essentially the "random" access case -- in that case, the read/write operation actually seeks backward in the file. Total operations is then a sum of the random, consecutive, and only sequential operations, if that makes sense.

I'll have to think a bit more and see if any tweaks to the plot make sense or if we should at least add some clarifying text about this.

* allow reuse of code used to create Python representation of C record structures

* function now returns the derived metrics struct _and_ a summary record as a tuple - this allows us to return both pieces of data that are provided by the C accumulator - summary record formatted using `_make_generic_record()`, with caller optionally specifying a record dtype * name changed to `accumulate_records()` to match more general purpose - 'log_' prefix dropped so we don't imply accumulation is tied to a specific log file, as most backend functions are * updated job summary tool and tests to accomodate above changes

* plot routine takes POSIX module dataframe as input - for now, plot using just the first row of the dataframe * only works for POSIX module data * Fixes #780

shanedsnyder · 2023-04-14T16:19:37Z

I'll have to think a bit more and see if any tweaks to the plot make sense or if we should at least add some clarifying text about this.

I ultimately just added a new figure caption to clarify that sequential is inclusive of consecutive. I'm not sure other tweaks will be much more helpful so I think I'm content with that. We could present "sequential" as being strictly sequential (i.e., does not include consecutive operations), but that also will require clarify text so not sure it's any better.

shanedsnyder · 2023-04-14T16:20:02Z

I think I'm all done with edits here and ready for any final review comments before merging.

tylerjereddy · 2023-04-16T23:52:21Z

I'll try to take a final look on April 17th, got swamped with grants and stuff this weekend

tylerjereddy · 2023-04-17T23:26:25Z

Ok, I looked at the diff again and checked a sample log->HTML locally and it seems "ok" and still consistent with the old perl report for access pattern plotting.

Since we really do need to wrap up the feature-completeness of the new Python report I'll go ahead and merge.

I think some of the regression testing didn't really happen here re: the values we see in the new plot, but I'll leave that up to the team I guess. As far as I can tell, we've just incremented the number of expected plots, which is at least better than nothing.

shanedsnyder requested a review from tylerjereddy February 22, 2023 20:55

github-actions bot added the pydarshan label Feb 22, 2023

tylerjereddy reviewed Feb 23, 2023

View reviewed changes

shanedsnyder force-pushed the snyder/pydarshan-posix-access-pattern branch from 5f6eec0 to a449459 Compare April 11, 2023 22:19

shanedsnyder added 8 commits April 14, 2023 09:41

add _make_generic_record() helper to backend

405eb57

* allow reuse of code used to create Python representation of C record structures

add new POSIX access pattern plot to summary

e8b68c4

* plot routine takes POSIX module dataframe as input - for now, plot using just the first row of the dataframe * only works for POSIX module data * Fixes #780

remove right/top axis spines, rotate annotation 45

1d8d7c0

move legend outside of plot region

90e4d27

remove dtype argument to accumulate_records()

38ae7d7

use namedtuple return type for accumulate_records

8411c70

remove confusing comment

43ee240

shanedsnyder force-pushed the snyder/pydarshan-posix-access-pattern branch from a449459 to 43ee240 Compare April 14, 2023 16:02

updated figure comment for clarity

629a5c5

tylerjereddy approved these changes Apr 17, 2023

View reviewed changes

tylerjereddy merged commit 122ad2c into main Apr 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP, ENH: add POSIX access pattern plot to darshan job summary #903

WIP, ENH: add POSIX access pattern plot to darshan job summary #903

shanedsnyder commented Feb 22, 2023

shanedsnyder commented Feb 22, 2023

tylerjereddy left a comment

shanedsnyder commented Feb 23, 2023

tylerjereddy commented Feb 23, 2023

tylerjereddy commented Feb 23, 2023

shanedsnyder commented Mar 31, 2023 •

edited

Loading

shanedsnyder commented Apr 14, 2023

shanedsnyder commented Apr 14, 2023

tylerjereddy commented Apr 16, 2023

tylerjereddy commented Apr 17, 2023

WIP, ENH: add POSIX access pattern plot to darshan job summary #903

WIP, ENH: add POSIX access pattern plot to darshan job summary #903

Conversation

shanedsnyder commented Feb 22, 2023

shanedsnyder commented Feb 22, 2023

tylerjereddy left a comment

Choose a reason for hiding this comment

shanedsnyder commented Feb 23, 2023

tylerjereddy commented Feb 23, 2023

tylerjereddy commented Feb 23, 2023

shanedsnyder commented Mar 31, 2023 • edited Loading

shanedsnyder commented Apr 14, 2023

shanedsnyder commented Apr 14, 2023

tylerjereddy commented Apr 16, 2023

tylerjereddy commented Apr 17, 2023

shanedsnyder commented Mar 31, 2023 •

edited

Loading