-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add computable report when graceful-read-failure is active in from_map
#415
Conversation
Codecov ReportAttention:
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## main #415 +/- ##
=======================================
Coverage 93.21% 93.21%
=======================================
Files 23 23
Lines 3168 3214 +46
=======================================
+ Hits 2953 2996 +43
- Misses 215 218 +3 ☔ View full report in Codecov by Sentry. |
res = result.map_partitions(first, meta=array_meta, output_divisions=1) | ||
rep = result.map_partitions(second, meta=empty_typetracer()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here res
is the actual interesting Array
collection. The array_meta
object is the meta we'd normally expect without going through the process of dual returns with the report. We can manually pass that meta object to map_partitions
to override its default behavior of automatically determining a new meta, we also make sure to preserve divisions if they're known.
And rep
is the new report Array
collection, it doesn't matter what the meta or divisions are, this should be a pretty small object once computed. If/when we converge on a proper schema for a report array perhaps we can pass in its typetracer version for absolute correctness here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we not specify meta
and let dask-awkward figure it out using the graph?
82ace9f
to
31e570d
Compare
31e570d
to
5dd9a13
Compare
I just gave the schema a little better definition, here it is in JSONSchema: {
"type": "object",
"properties": {
"args": {"type": "array", "items": {"type": "string"}},
"kwargs": {
"type": "array",
"items": {"type": "array", "items": {"type": "string"}},
},
"exception": {"type": "string"},
"message": {"type": "string"},
},
} We report back the function arguments, keyword arguments, the exception name, and the message associated with it.
The report is always the same length as there are number of partitions. So if we have five partitions and two failures, the computed report will end up looking something like (making up some arguments and a single keyword argument): [
{'args': [], 'kwargs': [], 'exception': '', 'message': ''}, # <-- succeeded
{'args': [], 'kwargs': [], 'exception': '', 'message': ''}, # <-- succeeded
{'args': ["/path/to/file2", "True", "2"], 'kwargs': [["kwarg1", "foo"]], 'exception': 'RuntimeError', 'message': 'Failed to read.'},
{'args': [], 'kwargs': [], 'exception': '', 'message': ''}, # <-- succeeded
{'args': ["/path/to/file4", "True", "4"], 'kwargs': [["kwarg1", "bar"]], 'exception': 'RuntimeError', 'message': 'Failed to read.'},
] |
b23fbc1
to
b477902
Compare
Oh - this is very good. I'll try to test it asap. If, instead of the partition number, it would just report the start and stop rows it would be more immediately useful to users I think. Otherwise there's some digging to do. |
Will this scale OK when there are 200k partitions? |
I imagine it would. I think a 200k element array isn't too crazy, especially when most elements will be records without much data. An alternative approach to shrink the report would be to actually have the elements of the "report array" that represent successful reads to just be None instead of the thin record. Something like: [
None,
None,
{'args': ["/path/to/file2", "True", "2"], 'kwargs': [["kwarg1", "foo"]], 'exception': 'RuntimeError', 'message': 'Failed to read.'},
None,
{'args': ["/path/to/file4", "True", "4"], 'kwargs': [["kwarg1", "bar"]], 'exception': 'RuntimeError', 'message': 'Failed to read.'},
] instead of what I wrote in my above comment.
Seems like we may want to add the ability to have consumers of this API be able to pass in a callback function that accepts the |
Ah I may have misunderstood, if the divisions are known it would be possible for dask-awkward to combine the known divisions with the failed partitions. |
Yeah I like the idea of being able to control the form of what gets returned, with some decent presets. You understood my question the first time correctly! If there are (somehow) 200k file-read errors are we going to cause further errors or hassle for the user. :-D Seems like not so much, but I wanted to double check with you. |
Great, I'm working on making what gets reported customizable, with what I think is a sane default. |
Oh I see what you mean. Yeah I mean for the return:
something like:
Would be more immediately useful. |
Ah in my little example that was just some made up set of arguments that were getting passed to the actual read function. So in the uproot case it would be the arguments passed to the callable class _UprootOpenAndRead, for example. I've just pushed a commit that makes the report construction customizable but with the default that we've already been discussing. To customize, the consumer of this API needs to define two functions: a success function and a failure function. The success function just takes
|
For reporting the rows, I think one would just need to combine the knowledge of def find_bad_rows(report, divisions):
bad_rows = []
for i, element in enumerate(report):
# if element in the report arrays shows "bad" (e.g. the "exception" field of the default cause is a non-empty string)
if element.exception:
bad_rows.append((divisions[i], divisions[i+1]))
return bad_rows
report_collection, array_collection = uproot.dask(..., report=True)
report, array = dask.compute(report_collection, array_collection)
bad_rows = find_bad_rows(report, array_collection.divisions) Or, if the |
Just kind of a cosmetic thing but can we have the return be: array, report = uproot.dask(...) that would match typetracer with report, etc. as far as consistency goes. |
Of course! this was just some shorthand -- I'm sure the PR to uproot will be be more thought out :) |
Solved |
@lgray this should be ready to get put to the fire by an implementation in uproot. I'll start drawing something up |
awesome! A parquet-based one would be cool too but is not as high-priority. |
def io_func_implements_projection(func: ImplementsIOFunction) -> bool: | ||
return hasattr(func, "prepare_for_projection") | ||
return hasattr(maybe_unwrap(func), "prepare_for_projection") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be tempted to remove the unwrapping here, and require that the caller unwrap the IO function. That way it's a bit more explicit?
@douglasdavis looks good! I made some comments. In general, do you think it's possible / sensible to move the Hopefully this review doesn't feel like a lot of nitpicks and lastminute.com suggestions. I'm glad you've thought about this, because it's important to get right, and I'm just adding some without-much-context thoughts. |
@agoose77 thanks a lot for the comments.
I have thought a bit about this, especially with the
Not at all! |
The more I think about it, the more I like smooshing most of the logic into having to live in the io_func classes. I'll see if I can refactor this into doing that and keeping the |
54b4d74
to
6720aa7
Compare
6720aa7
to
06fb97e
Compare
This changes the
from_map
interface to return two collections whenempty_on_raise
andempty_backend
are both notNone
. (When those arguments are bothNone
we just return a singleArray
, which has been the existing behavior until this PR, hopefully the@overload
s show this in a clear way)Array
collection is what we've always had: the lazy awkward array populated by data on disk.Array
is a record array, one element per partition, that represents a report about failed reads.After banging my head thinking of ways to handle reports inside the graph I realized it's possible for us to just make it an awkward array (instead of some new object, the rabbit hole I originally went down). I still need to check if this breaks any optimizations, but my intuition is that it shouldn't. I still wouldn't bet the farm :)
cc @lgray @agoose77 for your thoughts. This PR demonstrates the possibility, and I think it fits in nicely with the existing
uproot.iterate(..., report=True)
two-item return pattern.