Discussion: PyDarshan and libdarshan-util interactions, accumulator interface #888
Replies: 7 comments 6 replies
-
For my 2c, I think I'd probably rule out Solution 1 entirely but included it for discussion purposes. I would probably prefer Solution 3 as the immediate path forward (and I could provide more details and/or start hacking on a PR there if its helpful to consider some code), but am open to discussion. I think I prefer Solution 3 as it allows us offer accumulation as a first-class feature of the DarshanReport interface and doesn't rely on standing up additional (unnecessary) logic for converting back from Python data representations, as in Solution 2. |
Beta Was this translation helpful? Give feedback.
-
On my side, I think only solution 1 makes sense long-term, and I'm pretty strongly opposed to the other approaches long-term because it is just an absolute disaster to deal with the error handling/memory management at the C level as a developer, and because we really should start to try to shed our dependence on the in-house binary format idiosyncracies as much as possible. I think the C library code should really serve just 1 purpose--get the the data out of the custom binary darshan data format as quickly as possible and into the rich Python data analysis ecosystem format as soon as possible. We should just be using DataFrames to store the data, and interchange with C/C++ analysis codes should just happen through the standard Python ecosystem buffer interfaces used in so many other places. CFFI should probably be removed in favor of Cython if the C developers want to continue writing at the C layer, so that we can direclty pass NumPy arrays/memoryviews to/from C as needed, but I'd anticipate that most sane analyses should happen using pandas/SQL-style operations on the rows and columns of data. CFFI may have been a tempting short-term patch, but the amount of time I've spent dealing with memory leaks and segfaults through the opaque CFFI interface has been non-trivial. To achieve this, I think objective 2 is fine in the short term so the reports can be made to match the Perl reports and an overhaul doesn't interfere with sunsetting the Perl reports. That said, once objective 2 is in place, I think writing thorough Python tests for the aggregation functionality should be done over time, until we reach a point where we can sunset usage of the C bindings for that. I don't really care if that takes a few years for example. Most importantly, and likely controversially, I don't think we should ever write raw C code that operates on the record structs for analysis again (no new code there)--if the aggregation stuff stays around for a few years to avoid rewriting, "fine for now" think, but I think it is just a poor use of resources to neglect the power of the ecosystem unlocked by NumPy/ While I know (or hope) you're not proposing to write your own ML code in raw C at this point, I do think that whatever you do in this particular case should be the last case of supporting analysis code written in C to operate on the in-house record formats, and future analysis should be entirely confined to operating on Python-exposed data structures and Cython-level stuff on standard buffer formats if you really need C (but that should be rare). I also think is not really fair to characterize needing to write the Python algorithms three times for NumPy, |
Beta Was this translation helpful? Give feedback.
-
A fourth option for discussion. (I'm not sure this is a good idea either, but it seems like it warrants discussion) For some motivating context, conceptually it doesn't seem like it is supposed to be that hard for us to mix and match C and Python in cases where one or the other is more appropriate for some reason, or even mix and match other toolkits for that matter. From that perspective, maybe the problem is that we are using programmatic APIs as the interface rather than an agnostic data format and schema that would let different things access/manipulate/exchange darshan data without so much direct coupling and translation. The current in-memory format is always an array of sometimes-large C structs. Analysis code in any language therefore has to interpret C structs and be able to invoke native C functions that don't necessarily match it's memory model. The most popular language-agnostic format for this kind of thing (to my knowledge) would be Apache Arrow (https://arrow.apache.org/). The description on their page is better than I can paraphrase. It has two types of C bindings (one that just gives you fairly interpretation of data (https://arrow.apache.org/docs/format/CDataInterface.html) but can be embedded in a code base like libdarshan-util, and one that is an external dependency with a far broader scope (https://arrow.apache.org/docs/c_glib/). The python side has all of the trappings that you would expect, including python-native conversion to dataframes (https://arrow.apache.org/docs/python/pandas.html). We'd have to play with it to see if it does what it says on the tin, but adopting a language-neutral in memory data format might make it easier to use capabilities in both languages a little more easily. The C library could possibly emit and/or ingest arrow-formatted in-memory data on the fly, or we could consider a bulk conversion to a persistent file in parquet in-file format (https://parquet.apache.org/). |
Beta Was this translation helpful? Give feedback.
-
Also I agree that we need something short term based on methods we have that already (almost) work, but it seems like a good time while this is on our mind to think about long term strategy too. |
Beta Was this translation helpful? Give feedback.
-
Solution 1 is predicated on some pretty substantial changes, as it basically requires rewriting of nearly all of darshan-util components (sans log reading/writing, by my understanding) in Python -- otherwise we find ourselves maintaining both traditional C tools and new PyDarshan stuff. That's obviously impractical in the short term, maybe even in the longer term when considering funding/manpower. Even so, what exactly does the aggregation code look like at the Python level? As of now, the aggregation code can be optionally implemented when defining a module's logutil handlers in the darshan-util C library, which makes it an obvious part of a modules interface with Darshan's libraries. I won't argue the code is pretty or anything: darshan/darshan-util/darshan-stdio-logutils.c Line 393 in b9043e1 But I really can't imagine how the pandas code would be any simpler for this. As a comparison, consider some of the example aggregation code in PyDarshan: https://github.com/darshan-hpc/darshan/blob/main/darshan-util/pydarshan/darshan/experimental/aggregators/agg_ioops.py This code is not comprehensive, it only aggregates portions of the record which are really straightforward sums (operation counts, access size histograms). This pandas code will require detailed knowledge of the C representation of instrumentation records (different counter types use different aggregation operations), which I think also just creates more disconnect as that record representation is maintained in the C library. Just sounds like a mess, but perhaps I could be open minded if I knew exactly what this Python code would look like. Maybe just moving forward with Solution 2 in the short term is the least controversial way to get to a functional job summary tool, and gives us more time to come up with a longer term solution? It offers us the most flexibility to aggregate basically any records together now, with perhaps something like @carns suggestion for Solution 4 ultimately helping define a common memory format for darshan-util/PyDarshan. If we do eventually PyDarshan all of the things, then we will still need some solution to memory interoperability between Python and C as the log writing functionality will require that. In the meantime, Solution 2 can probably provide the stopgap. |
Beta Was this translation helpful? Give feedback.
-
Similar to what came through in some of the discussions already I would also consider a hybrid solution as far as the whole of PyDarshan is concerned, but actually, also reduce and eventually drop the dependency for any "in-house" C for aggregations but instead think about how to allow querying/lazy loading into logs without running into double-counting issues on the Python/Pandas/Arrow/or whatever-aggregation-solution side.
|
Beta Was this translation helpful? Give feedback.
-
Thanks for the comments, all. I think there's a lot of good discussion here, in terms of shortcomings of what we have now and different short- and long-term plans to improve things. Acknowledging that it's a shorter term plan and not ideal, it sounds like we may at least have enough agreement to just follow route 2 from my original discussion comment, where we use glue code Tyler has prototyped to convert back from dataframes representation into raw buffers expected by the C library? That will probably give us the quickest path to all the functionality we need for comprehensive Darshan job reports we have been trying to work towards, even if it's an inefficient/quirky process. Longer term, I completely agree with implementing new analysis capabilities in Python going forward, and potentially just moving all Darshan utilities to Python if it seems tenable. I think we are starting to come around to wanting to experiment with things like Arrow/Parquet to see if they can help alleviate some of the quirks of working with Darshan logs/interfaces -- that's something that we could investigate in parallel with further development/refinement of PyDarshan analysis tools, or maybe even something some summer students could be interested in working on. Definitely worth looking into. |
Beta Was this translation helpful? Give feedback.
-
Problem
PyDarshan CFFI bindings convert the record pointers returned by libdarshan-util C library into Pythonic data structures (dictionaries, numpy arrays, pandas dataframes), that cannot be easily passed back into other libdarshan-util routines. That opens up questions on how to generally handle a couple of things in PyDarshan:
Aggregation Use Cases
--file-list
option from darshan-parser which provided a single record for every single file accessed by the app--file-list
style aggregation across multiple Darshan logs from a workflowSolutions:
Beta Was this translation helpful? Give feedback.
All reactions