-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat!: remove protocol for form transformation #382
Conversation
Codecov Report
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. @@ Coverage Diff @@
## main #382 +/- ##
==========================================
- Coverage 93.23% 93.21% -0.02%
==========================================
Files 24 24
Lines 3043 3038 -5
==========================================
- Hits 2837 2832 -5
Misses 206 206
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
Makes sense to me, since this was @lgray's I'll ask him to give it a green light as well |
I think it's OK - but we need to find a place for it to live in the long run that's not coffea since it's not generally coffea specific (HEP, however, appears to be the only place where we need it presently). Having a factorization point between serialized data and user interface is pretty reasonable IMO, but you need to do some pretty advanced data transformations to require it. |
@lgray my view is that right now only coffea uses this. I don't see it as being a dask-awkward feature unless we extend the various ingest interfaces to support it. So right now I think if coffea want to generalise to other packages, perhaps we make a new |
One of the first things I'm going to do one we get CoffeaTeam/coffea#900 is to turn the parquet nanoevents back on and start testing it routinely, which will require a patch to dask-awkward. This also paves the way for future backends being easily ingested. I'm certainly happy with another package if that seems like a prudent long term choice, but given the first thing I'm going to do is make it work for parquet, I'm pushing a bit to make sure we're not being overly strict about this. |
Maybe the wrong place for this discussion, but i wonder when we create parquet files why we would even write them so "flat" like NanoAOD that we need to remap the form and not just write them in the structure we want it to have in the first place? Isn't it a nice feature of arrow/parquet that it seamlessly maps to awkward array? |
After Arrow version 2 or so, it has become very good at writing NanoEvents-type structures (e.g. with lists of particle-records, rather than separate jagged arrays for each particle attribute). Due to the way that Parquet stores data, NanoEvents-form and NanoAOD-form would use the same disk space, but if it's easier to just write and read Parquet without changing the form, that would be a natural thing to do. |
@lgray given how easy it is to publish new versions and packages, I am feeling that we don't need to move this to dask-awkward at this time. I wonder whether As such, are you content if we merge this PR, leave things in uproot for now, and subsequently handle this on the downstream side? |
Yes - been happy with proceeding from the start - this particular branch of conversation is not a show stopper. |
a2762a1
to
395879b
Compare
The test failures are due to Arrow being unavailable for Python 3.12. As the other tests pass, and we can see that error, I'm going to merge this one and get it off the queue. |
This should be done in uproot, I think.