-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP, ENH: add POSIX access pattern plot to darshan job summary #903
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably to add a page break between plots
On my monitor they aren't too close together--it probably depends on screen size and stuff--probably an html/css layout that auto-adapts or something would be good, but fancier stuff there may be a can of worms (darshan reports on the phone, etc...).
It is good that the plot looks similar vs. old perl report on same log file:
Some thoughts on the new plot:
- the 9 consecutive read counts are only visible because of the value labels added above the bar chart values--not sure if a threshold for log scaling might be warranted (I don't really care too much)--you could probably argue about the merits of emphasizing size differences vs. not noticing a non-zero entry...
- why isn't the total category equal to the sum of consecutive and sequential (i.e., the read values here?)--if the categories aren't mutually exclusive I wonder if that is worth noting
For the other points, I'd prefer to keep the discussion focused and have issues opened to elaborate on the pros/cons:
I thought about this a bit, and think it would be good to generalize the plotting routines we provide away from report objects, to allow us to plot data coming from multiple log files. I didn't modify any other plotting routines, but could take a pass at this in another PR if that seems reasonable?
I'm perhaps less convinced of the value/priority for now--I think an issue with a clear description of concrete use cases and how the API would work for the current usages is likely in order. I believe some of our plotting routines behave quite differently, so I'm also not entirely convinced this strategy is appropriate in a super broad sense, and I'd also like the plotting functions as simple as possible--avoid doing data analysis inside the plotting function proper where possible.
If it were clear that this reduced the amount of code substantially that might be convincing, though for a infrastructure that is initially "report"/single file focused, it seems likely this would initially grow more complexity to allow generalization, and may bloat plotting functions with shims/analyses that have little to do with plotting proper.
For folks that need functionality this advanced, I think an argument might need to be made that they can't just write their own plotting code? Thinking about the heatmap plotting, my brain melts a bit re: fusing files and DataFrames from multiple sources, and supporting bugs from users related to this kind of thing. Probably other such examples, and cases that are a bit more on the fence. So, my initial thought is--case-by-case basis with strong real-world justification, and try to keep plotting code devoid of analysis complexity and instead accept the final condensed/consensus data structures?
darshan-util/pydarshan/darshan/experimental/plots/plot_posix_access_pattern.py
Outdated
Show resolved
Hide resolved
darshan-util/pydarshan/darshan/experimental/plots/plot_posix_access_pattern.py
Show resolved
Hide resolved
darshan-util/pydarshan/darshan/experimental/plots/plot_posix_access_pattern.py
Show resolved
Hide resolved
I'll respond more to other comments later, but wanted to start here:
Sure, we could document what each existing plot is expecting in terms of data in an issue or something. At a high-level, all they do now is take a report as input and extract all the data they need from that. I think the change I'm suggesting would be to instead have the caller of the plotting routines responsible for actually extracting the data in the format the plots expect and passing in directly, instead of having the plot routines digging around in report objects. Seems like that would actually simplify plotting code at the expense of requiring callers to be more explicit about the data they want plotted. Of course, the main driver of this is not really code cleanup, it's just to generalize our plotting routines so that they could conceivably plot data that comes from multiple log files. As it stands, that is impossible, which will be a definite problem going forward. We really need this capability to extend Darshan to be useful in context of analyzing workflows (that generate many log files) or for analyzing tons of logs generated at a facility. I think continuing to tie our plotting infrastructure to single logs would be a mistake in that regard.
The use cases I mention above are definitely of interest to our users. I'm not sure the changes are really that challenging (we'll see, I think the plot in this PR provides some supporting evidence that this could be relatively simple) that we have to punt on this for users. The heatmaps will almost certainly be most involved, but I don't think we have to figure that out all at once. The other plots seem like they would be incredibly straightforward to switch to a dataframe or record based input, which would make them trivial to apply for single- or multi-log use cases. I don't really see the downside other than that it requires a small refactor? |
Maybe just shift that stuff to an issue yeah--for code that is basically "produce a bar chart" like here, I don't really even see much value in having it public, could basically just have the aggregator public and a small example in its docstring of how you might condense and plot, freeing us from any plotting code maintenance entirely. |
And if DataFrames are embraced more globally in the code base, they already have plotting methods attached to them as well, so calling |
I can see how this could be confusing, as I was convinced there was a problem too before thinking about it more. The issue is that the "sequential" count is inclusive of the "consecutive" count (see the definitions in the figure caption). I.e., you have to subtract the consecutive count from the sequential count to get the count of operations that were only sequential. There is a 3rd implicit category not represented which covers what is essentially the "random" access case -- in that case, the read/write operation actually seeks backward in the file. Total operations is then a sum of the random, consecutive, and only sequential operations, if that makes sense. I'll have to think a bit more and see if any tweaks to the plot make sense or if we should at least add some clarifying text about this. |
5f6eec0
to
a449459
Compare
* allow reuse of code used to create Python representation of C record structures
* function now returns the derived metrics struct _and_ a summary record as a tuple - this allows us to return both pieces of data that are provided by the C accumulator - summary record formatted using `_make_generic_record()`, with caller optionally specifying a record dtype * name changed to `accumulate_records()` to match more general purpose - 'log_' prefix dropped so we don't imply accumulation is tied to a specific log file, as most backend functions are * updated job summary tool and tests to accomodate above changes
* plot routine takes POSIX module dataframe as input - for now, plot using just the first row of the dataframe * only works for POSIX module data * Fixes #780
a449459
to
43ee240
Compare
I ultimately just added a new figure caption to clarify that sequential is inclusive of consecutive. I'm not sure other tweaks will be much more helpful so I think I'm content with that. We could present "sequential" as being strictly sequential (i.e., does not include consecutive operations), but that also will require clarify text so not sure it's any better. |
I think I'm all done with edits here and ready for any final review comments before merging. |
I'll try to take a final look on April 17th, got swamped with grants and stuff this weekend |
Ok, I looked at the diff again and checked a sample log->HTML locally and it seems "ok" and still consistent with the old Since we really do need to wrap up the feature-completeness of the new Python report I'll go ahead and merge. I think some of the regression testing didn't really happen here re: the values we see in the new plot, but I'll leave that up to the team I guess. As far as I can tell, we've just incremented the number of expected plots, which is at least better than nothing. |
See commits for more details.
If things look good, I can add some tests that can at least check some expected
summary_record
counter values returned byacummulate_records()
.A few thoughts:
accumulate_records()
into a more Pythonic object?summary_record
is a dictionary now (i.e., not a cdata object), so seems that it would be cleaner if both return values are typical Python things, rather thanderived_metrics
being a C-like structure.