Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP, ENH: add POSIX access pattern plot to darshan job summary #903

Merged
merged 9 commits into from
Apr 17, 2023

Conversation

shanedsnyder
Copy link
Contributor

See commits for more details.

If things look good, I can add some tests that can at least check some expected summary_record counter values returned by acummulate_records().

A few thoughts:

  • This is the first plotting routine that takes a dataframe as input (others take a report object). I thought about this a bit, and think it would be good to generalize the plotting routines we provide away from report objects, to allow us to plot data coming from multiple log files. I didn't modify any other plotting routines, but could take a pass at this in another PR if that seems reasonable?
  • Related to above, but this new plotting routine is hardcoded to get plot data from the first row of the input dataframe for now. That works fine for the accumulator use case obviously since it's just a single row, but I'm not sure how to best generalize this. Seems like it should be easy? Ultimately, I think it would be nice if you could give the plotting routine a single row (in which case it plots the row values directly) or an entire dataframe (in which case it will need to run the accumulator internally to condense down to a single record). Any thoughts on this?
  • Can we convert the derived metrics structure returned by accumulate_records() into a more Pythonic object? summary_record is a dictionary now (i.e., not a cdata object), so seems that it would be cleaner if both return values are typical Python things, rather than derived_metrics being a C-like structure.

@shanedsnyder
Copy link
Contributor Author

Here's what a new example report looks like for now (e3sm_io_heatmap_only.darshan):
image

So, obvious things to cleanup are to make sure the legend isn't on top of plot data, and probably to add a page break between plots (or otherwise reorg things so the big table isn't jammed next to this plot).

Copy link
Collaborator

@tylerjereddy tylerjereddy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably to add a page break between plots

On my monitor they aren't too close together--it probably depends on screen size and stuff--probably an html/css layout that auto-adapts or something would be good, but fancier stuff there may be a can of worms (darshan reports on the phone, etc...).

image

It is good that the plot looks similar vs. old perl report on same log file:

image

Some thoughts on the new plot:

  • the 9 consecutive read counts are only visible because of the value labels added above the bar chart values--not sure if a threshold for log scaling might be warranted (I don't really care too much)--you could probably argue about the merits of emphasizing size differences vs. not noticing a non-zero entry...
  • why isn't the total category equal to the sum of consecutive and sequential (i.e., the read values here?)--if the categories aren't mutually exclusive I wonder if that is worth noting

For the other points, I'd prefer to keep the discussion focused and have issues opened to elaborate on the pros/cons:

I thought about this a bit, and think it would be good to generalize the plotting routines we provide away from report objects, to allow us to plot data coming from multiple log files. I didn't modify any other plotting routines, but could take a pass at this in another PR if that seems reasonable?

I'm perhaps less convinced of the value/priority for now--I think an issue with a clear description of concrete use cases and how the API would work for the current usages is likely in order. I believe some of our plotting routines behave quite differently, so I'm also not entirely convinced this strategy is appropriate in a super broad sense, and I'd also like the plotting functions as simple as possible--avoid doing data analysis inside the plotting function proper where possible.

If it were clear that this reduced the amount of code substantially that might be convincing, though for a infrastructure that is initially "report"/single file focused, it seems likely this would initially grow more complexity to allow generalization, and may bloat plotting functions with shims/analyses that have little to do with plotting proper.

For folks that need functionality this advanced, I think an argument might need to be made that they can't just write their own plotting code? Thinking about the heatmap plotting, my brain melts a bit re: fusing files and DataFrames from multiple sources, and supporting bugs from users related to this kind of thing. Probably other such examples, and cases that are a bit more on the fence. So, my initial thought is--case-by-case basis with strong real-world justification, and try to keep plotting code devoid of analysis complexity and instead accept the final condensed/consensus data structures?

@shanedsnyder
Copy link
Contributor Author

I'll respond more to other comments later, but wanted to start here:

I thought about this a bit, and think it would be good to generalize the plotting routines we provide away from report objects, to allow us to plot data coming from multiple log files. I didn't modify any other plotting routines, but could take a pass at this in another PR if that seems reasonable?

I'm perhaps less convinced of the value/priority for now--I think an issue with a clear description of concrete use cases and how the API would work for the current usages is likely in order. I believe some of our plotting routines behave quite differently, so I'm also not entirely convinced this strategy is appropriate in a super broad sense, and I'd also like the plotting functions as simple as possible--avoid doing data analysis inside the plotting function proper where possible.

If it were clear that this reduced the amount of code substantially that might be convincing, though for a infrastructure that is initially "report"/single file focused, it seems likely this would initially grow more complexity to allow generalization, and may bloat plotting functions with shims/analyses that have little to do with plotting proper.

Sure, we could document what each existing plot is expecting in terms of data in an issue or something. At a high-level, all they do now is take a report as input and extract all the data they need from that. I think the change I'm suggesting would be to instead have the caller of the plotting routines responsible for actually extracting the data in the format the plots expect and passing in directly, instead of having the plot routines digging around in report objects. Seems like that would actually simplify plotting code at the expense of requiring callers to be more explicit about the data they want plotted.

Of course, the main driver of this is not really code cleanup, it's just to generalize our plotting routines so that they could conceivably plot data that comes from multiple log files. As it stands, that is impossible, which will be a definite problem going forward. We really need this capability to extend Darshan to be useful in context of analyzing workflows (that generate many log files) or for analyzing tons of logs generated at a facility. I think continuing to tie our plotting infrastructure to single logs would be a mistake in that regard.

For folks that need functionality this advanced, I think an argument might need to be made that they can't just write their own plotting code? Thinking about the heatmap plotting, my brain melts a bit re: fusing files and DataFrames from multiple sources, and supporting bugs from users related to this kind of thing. Probably other such examples, and cases that are a bit more on the fence. So, my initial thought is--case-by-case basis with strong real-world justification, and try to keep plotting code devoid of analysis complexity and instead accept the final condensed/consensus data structures?

The use cases I mention above are definitely of interest to our users. I'm not sure the changes are really that challenging (we'll see, I think the plot in this PR provides some supporting evidence that this could be relatively simple) that we have to punt on this for users. The heatmaps will almost certainly be most involved, but I don't think we have to figure that out all at once. The other plots seem like they would be incredibly straightforward to switch to a dataframe or record based input, which would make them trivial to apply for single- or multi-log use cases. I don't really see the downside other than that it requires a small refactor?

@tylerjereddy
Copy link
Collaborator

Maybe just shift that stuff to an issue yeah--for code that is basically "produce a bar chart" like here, I don't really even see much value in having it public, could basically just have the aggregator public and a small example in its docstring of how you might condense and plot, freeing us from any plotting code maintenance entirely.

@tylerjereddy
Copy link
Collaborator

And if DataFrames are embraced more globally in the code base, they already have plotting methods attached to them as well, so calling DataFrame.plot.bar() and so on may very well get users close enough in many cases, just need some small docs that are tested/maintained with examples.

@shanedsnyder
Copy link
Contributor Author

shanedsnyder commented Mar 31, 2023

why isn't the total category equal to the sum of consecutive and sequential (i.e., the read values here?)--if the categories aren't mutually exclusive I wonder if that is worth noting

I can see how this could be confusing, as I was convinced there was a problem too before thinking about it more. The issue is that the "sequential" count is inclusive of the "consecutive" count (see the definitions in the figure caption). I.e., you have to subtract the consecutive count from the sequential count to get the count of operations that were only sequential.

There is a 3rd implicit category not represented which covers what is essentially the "random" access case -- in that case, the read/write operation actually seeks backward in the file. Total operations is then a sum of the random, consecutive, and only sequential operations, if that makes sense.

I'll have to think a bit more and see if any tweaks to the plot make sense or if we should at least add some clarifying text about this.

@shanedsnyder shanedsnyder force-pushed the snyder/pydarshan-posix-access-pattern branch from 5f6eec0 to a449459 Compare April 11, 2023 22:19
* allow reuse of code used to create Python representation of
  C record structures
* function now returns the derived metrics struct _and_ a summary
  record as a tuple
  - this allows us to return both pieces of data that are provided
    by the C accumulator
  - summary record formatted using `_make_generic_record()`, with
    caller optionally specifying a record dtype
* name changed to `accumulate_records()` to match more general
  purpose
  - 'log_' prefix dropped so we don't imply accumulation is tied
    to a specific log file, as most backend functions are
* updated job summary tool and tests to accomodate above changes
* plot routine takes POSIX module dataframe as input
  - for now, plot using just the first row of the dataframe
* only works for POSIX module data
* Fixes #780
@shanedsnyder shanedsnyder force-pushed the snyder/pydarshan-posix-access-pattern branch from a449459 to 43ee240 Compare April 14, 2023 16:02
@shanedsnyder
Copy link
Contributor Author

I'll have to think a bit more and see if any tweaks to the plot make sense or if we should at least add some clarifying text about this.

I ultimately just added a new figure caption to clarify that sequential is inclusive of consecutive. I'm not sure other tweaks will be much more helpful so I think I'm content with that. We could present "sequential" as being strictly sequential (i.e., does not include consecutive operations), but that also will require clarify text so not sure it's any better.

@shanedsnyder
Copy link
Contributor Author

I think I'm all done with edits here and ready for any final review comments before merging.

@tylerjereddy
Copy link
Collaborator

I'll try to take a final look on April 17th, got swamped with grants and stuff this weekend

@tylerjereddy
Copy link
Collaborator

Ok, I looked at the diff again and checked a sample log->HTML locally and it seems "ok" and still consistent with the old perl report for access pattern plotting.

Since we really do need to wrap up the feature-completeness of the new Python report I'll go ahead and merge.

I think some of the regression testing didn't really happen here re: the values we see in the new plot, but I'll leave that up to the team I guess. As far as I can tell, we've just incremented the number of expected plots, which is at least better than nothing.

@tylerjereddy tylerjereddy merged commit 122ad2c into main Apr 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants