Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the I/O functionality use the podio provided functionality #69

Closed
tmadlener opened this issue Nov 18, 2021 · 3 comments
Closed

Make the I/O functionality use the podio provided functionality #69

tmadlener opened this issue Nov 18, 2021 · 3 comments
Labels
enhancement New feature or request

Comments

@tmadlener
Copy link
Contributor

For historical reasons the I/O implementations of standalone podio and the one that is present here have diverged a bit and so now they over in principle the same but still slightly different functionality. I think the framework should use the podio facilities as far as possible, and I originally thought that this would be a somewhat mechanical but in the end straight forward thing to do. However, I have realized that there is a bit more work involved, so that I am recording my observations here. In the end, I think that changes to podio are also necessary and that it might be best to first stabilize the interfaces podio offers before we actually start to work on this here.

High level functionality differences

The following table gives an overview of the things were podio and k4FWCore differ in high level (i.e. user perceivable) functionality

podio k4FWCore
vector user data UserDataCollection<T> (since v00-14, compile time limited T) DataHandle<std::vector<T>> (dictionary limited T, may fail silently(?) on I/O). There is #25 that might impact the usefulness of this feature(?)
other user data N/A DataHandle<T> (dedicated handling of ints and floats, but in principle again ROOT dictionary limited)

I am not entirely sure how widespread the usage of these features is throughout the Key4hep components. Hence, it is also hard to gauge whether some of the functionality could be easily removed from k4FWCore (e.g. the possibility to store single int/float values per event). This is something that probably needs discussing.

Technicalities

In k4FWCore the PodioDataSvc is handling the actual reading of the collections, it holds a podio::ROOTReader and a podio::EventStore as members that do the heavy lifting in this regard. The k4DataSvc is in essence a very thin wrapper around the PodioDataSvc that exposes the filename(s) as property to be configured from the options file. The PodioInput algorithm is responsible for actually triggering the reading of the collections (that are specified as a property) in its execute method. For this it just loops over the list of collections to read and calls PodioDataSvc::readCollection. For writing collections there is the PodioOutput algorithm, that basically re-implements the functionality of the podio::ROOTWriter. It holds a KeepDropSwitch to control which collections to actually write to file. In all of this there are a few subtle differences between podio and k4FWCore that make a "trivial" switch to podio facilities impossible. The following table provides a (probably incomplete) overview of them:

podio k4FWCore
output collection handling collections are created once via EventStore::create and then simply cleared in the event loop after writing. collections are re-created every event and the EventStore in PodioDataSvc never gets to know about them
event data tree owned by ROOTWriter, no outside access owned by PodioDataSvc. Possible to get access to it
output collection writing ROOTWriter::registerForWrite collects a list of collection names to write. Checks in EventStore if collections are actually available before adding them to the list. In event loop simply take this list and write (i.e. set branch addresses) and fill the event data tree. In every call to PodioOutput::execute get the complete list of collections from PodioDataSvc and check via the KeepDropSwitch which collections to write, before setting branch addresses and filling the event data tree.
branch creation for user data UserDataCollections are handled the same as other collections DataHandle creates necessary branches as it also has access to the PodioDataSvc (and the event data tree therein). The DataHandle also makes sure to do the proper branch address re-setting.
file level meta data N/A PodioOutput writes the options file config into a separate branch of the meta data tree
I/O file formats ROOT and SIO. (probably incomplete) abstract IReader interface for reading. Separate writer implementations (with equal interfaces) Only ROOT, but at least for reading a switch to the IReader interface should enable reading SIO out of the box

In the end to get everything working the same and using the same facilities, some discussion is required to decide which functionality needs to be supported from podio, which functionality can be built on top of podio here, and most importantly how the interfaces have to look like to enable all this functionality.

@tmadlener tmadlener added the enhancement New feature or request label Nov 18, 2021
@vvolkl
Copy link
Contributor

vvolkl commented Nov 19, 2021

Hi Thomas, thanks for this comprehensive issue. I think in the end it's simpler than the tables make it seem: the UserData functionality can completely replace what is used now for writing out vector etc. The only thing that I see missing on the podio side is something to allow the reader to ignore certain collections in the end store when writing as was done here with the KeepDropSwitch, but any implementation/interface for that is fine.

@tmadlener
Copy link
Contributor Author

Hi Valentin,
Yes, I agree it probably looks worse than it actually is. I think the major problem for a straight forward migration is the difference in the ownership of the event data tree. It is not yet entirely clear to me how I can make the EventStore in the PodioDataSvc aware of the collections that are created during an event. My attempts so far have not succeeded in that, because collections are recreated every event, but podio currently foresees the creation/registration with the EventStore only once per collection.

The only thing that I see missing on the podio side is something to allow the reader to ignore certain collections in the end store when writing as was done here with the KeepDropSwitch, but any implementation/interface for that is fine.

That could be achieved by only registering the collections that should be kept with the writer (registerForWrite has to be called for every collection that should be written once before the first event is written). In my first approach I simply did this when writing the first event. That seemed to have worked (except for the fact, that I wasn't able to connect the branches to the collections properly)

@tmadlener
Copy link
Contributor Author

This should be done with #100

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants