feat: add computable report when graceful-read-failure is active in `from_map` #415

douglasdavis · 2023-11-14T17:59:29Z

This changes the from_map interface to return two collections when empty_on_raise and empty_backend are both not None. (When those arguments are both None we just return a single Array, which has been the existing behavior until this PR, hopefully the @overloads show this in a clear way)

The first returned Array collection is what we've always had: the lazy awkward array populated by data on disk.
The second returned Array is a record array, one element per partition, that represents a report about failed reads.

After banging my head thinking of ways to handle reports inside the graph I realized it's possible for us to just make it an awkward array (instead of some new object, the rabbit hole I originally went down). I still need to check if this breaks any optimizations, but my intuition is that it shouldn't. I still wouldn't bet the farm :)

cc @lgray @agoose77 for your thoughts. This PR demonstrates the possibility, and I think it fits in nicely with the existing uproot.iterate(..., report=True) two-item return pattern.

codecov-commenter · 2023-11-14T18:01:17Z

Codecov Report

Attention: 3 lines in your changes are missing coverage. Please review.

Comparison is base (dd3267a) 93.21% compared to head (06fb97e) 93.21%.

Files	Patch %	Lines
src/dask_awkward/layers/layers.py	90.90%	2 Missing ⚠️
src/dask_awkward/lib/io/io.py	97.22%	1 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #415   +/-   ##
=======================================
  Coverage   93.21%   93.21%           
=======================================
  Files          23       23           
  Lines        3168     3214   +46     
=======================================
+ Hits         2953     2996   +43     
- Misses        215      218    +3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

douglasdavis · 2023-11-14T18:11:08Z

src/dask_awkward/lib/io/io.py

+ res = result.map_partitions(first, meta=array_meta, output_divisions=1)
+ rep = result.map_partitions(second, meta=empty_typetracer())


Here res is the actual interesting Array collection. The array_meta object is the meta we'd normally expect without going through the process of dual returns with the report. We can manually pass that meta object to map_partitions to override its default behavior of automatically determining a new meta, we also make sure to preserve divisions if they're known.

And rep is the new report Array collection, it doesn't matter what the meta or divisions are, this should be a pretty small object once computed. If/when we converge on a proper schema for a report array perhaps we can pass in its typetracer version for absolute correctness here

Can we not specify meta and let dask-awkward figure it out using the graph?

douglasdavis · 2023-11-27T17:23:32Z

I just gave the schema a little better definition, here it is in JSONSchema:

{
    "type": "object",
    "properties": {
        "args": {"type": "array", "items": {"type": "string"}},
        "kwargs": {
            "type": "array",
            "items": {"type": "array", "items": {"type": "string"}},
        },
        "exception": {"type": "string"},
        "message": {"type": "string"},
    },
}

We report back the function arguments, keyword arguments, the exception name, and the message associated with it.

args is a list of strings
kwargs is a list of list of strings, the inner list is always length two: [key, value]
exception is a list of strings
message is a list of strings

The report is always the same length as there are number of partitions. So if we have five partitions and two failures, the computed report will end up looking something like (making up some arguments and a single keyword argument):

[
    {'args': [], 'kwargs': [], 'exception': '', 'message': ''},   # <-- succeeded
    {'args': [], 'kwargs': [], 'exception': '', 'message': ''},   # <-- succeeded
    {'args': ["/path/to/file2", "True", "2"], 'kwargs': [["kwarg1", "foo"]], 'exception': 'RuntimeError', 'message': 'Failed to read.'},
    {'args': [], 'kwargs': [], 'exception': '', 'message': ''},   # <-- succeeded
    {'args': ["/path/to/file4", "True", "4"], 'kwargs': [["kwarg1", "bar"]], 'exception': 'RuntimeError', 'message': 'Failed to read.'},
]

lgray · 2023-11-28T19:51:40Z

Oh - this is very good. I'll try to test it asap. If, instead of the partition number, it would just report the start and stop rows it would be more immediately useful to users I think. Otherwise there's some digging to do.

lgray · 2023-11-28T20:00:20Z

Will this scale OK when there are 200k partitions?

douglasdavis · 2023-11-28T20:16:41Z

Will this scale OK when there are 200k partitions?

I imagine it would. I think a 200k element array isn't too crazy, especially when most elements will be records without much data.

An alternative approach to shrink the report would be to actually have the elements of the "report array" that represent successful reads to just be None instead of the thin record. Something like:

[
    None,
    None,
    {'args': ["/path/to/file2", "True", "2"], 'kwargs': [["kwarg1", "foo"]], 'exception': 'RuntimeError', 'message': 'Failed to read.'},
    None,
    {'args': ["/path/to/file4", "True", "4"], 'kwargs': [["kwarg1", "bar"]], 'exception': 'RuntimeError', 'message': 'Failed to read.'},
]

instead of what I wrote in my above comment.

If, instead of the partition number, it would just report the start and stop rows it would be more immediately useful to users I think

Seems like we may want to add the ability to have consumers of this API be able to pass in a callback function that accepts the *args and **kwargs (with domain knowledge) and construct something to add to the report. As it stands dask-awkward can only blindly accept the args and kwargs.

douglasdavis · 2023-11-28T20:18:45Z

Ah I may have misunderstood, if the divisions are known it would be possible for dask-awkward to combine the known divisions with the failed partitions.

lgray · 2023-11-28T20:40:48Z

Yeah I like the idea of being able to control the form of what gets returned, with some decent presets.

You understood my question the first time correctly! If there are (somehow) 200k file-read errors are we going to cause further errors or hassle for the user. :-D Seems like not so much, but I wanted to double check with you.

douglasdavis · 2023-11-28T20:42:12Z

Great, I'm working on making what gets reported customizable, with what I think is a sane default.

lgray · 2023-11-28T20:42:27Z

Oh I see what you mean.

Yeah I mean for the return:

[
    None,
    None,
    {'args': ["/path/to/file2", "True", "2"], 'kwargs': [["kwarg1", "foo"]], 'exception': 'RuntimeError', 'message': 'Failed to read.'},
    None,
    {'args': ["/path/to/file4", "True", "4"], 'kwargs': [["kwarg1", "bar"]], 'exception': 'RuntimeError', 'message': 'Failed to read.'},
]

something like:

[
    None,
    None,
    {'args': ["/path/to/file2", "True", "200-300"], 'kwargs': [["kwarg1", "foo"]], 'exception': 'RuntimeError', 'message': 'Failed to read.'},
    None,
    {'args': ["/path/to/file4", "True", "400-500"], 'kwargs': [["kwarg1", "bar"]], 'exception': 'RuntimeError', 'message': 'Failed to read.'},
]

Would be more immediately useful.
If that last value is the partition number.

douglasdavis · 2023-11-28T21:05:30Z

Ah in my little example that was just some made up set of arguments that were getting passed to the actual read function. So in the uproot case it would be the arguments passed to the callable class _UprootOpenAndRead, for example.

I've just pushed a commit that makes the report construction customizable but with the default that we've already been discussing.

To customize, the consumer of this API needs to define two functions: a success function and a failure function. The success function just takes f(*args, **kwargs), while the failure function takes f(exception, *args, **kwargs). Both must return an awkward array. The default versions for each are pretty easy to interpret.

from_map now has four new arguments -- I'm not loving it but I'm also having a hard time picturing an alternative

douglasdavis · 2023-11-28T21:12:31Z

For reporting the rows, I think one would just need to combine the knowledge of

def find_bad_rows(report, divisions):
    bad_rows = []
    for i, element in enumerate(report):
        # if element in the report arrays shows "bad" (e.g. the "exception" field of the default cause is a non-empty string)
        if element.exception:
            bad_rows.append((divisions[i], divisions[i+1]))
    return bad_rows

report_collection, array_collection = uproot.dask(..., report=True)
report, array = dask.compute(report_collection, array_collection)
bad_rows = find_bad_rows(report, array_collection.divisions)

Or, if the *args and **kwargs have any knowledge of eventual rows, that can be embedded in the consumer defined success/failure functions.

lgray · 2023-11-28T21:31:21Z

Just kind of a cosmetic thing but can we have the return be:

array, report = uproot.dask(...)

that would match typetracer with report, etc. as far as consistency goes.

douglasdavis · 2023-11-28T21:33:20Z

Of course! this was just some shorthand -- I'm sure the PR to uproot will be be more thought out :)

douglasdavis · 2023-11-29T15:44:28Z

As this PR stands our column optimization breaks because the io_func is getting wrapped with our return_empty_on_raise function. The optimization check to see if something is projectable fails because of the wrapping. I'm trying to solve this now.

Solved

douglasdavis · 2023-11-29T18:49:02Z

@lgray this should be ready to get put to the fire by an implementation in uproot. I'll start drawing something up

lgray · 2023-11-29T19:18:16Z

awesome! A parquet-based one would be cool too but is not as high-priority.

src/dask_awkward/layers/layers.py

src/dask_awkward/lib/io/io.py

agoose77 · 2023-11-29T22:00:43Z

src/dask_awkward/layers/layers.py

 def io_func_implements_projection(func: ImplementsIOFunction) -> bool:
- return hasattr(func, "prepare_for_projection")
+ return hasattr(maybe_unwrap(func), "prepare_for_projection")


I'd be tempted to remove the unwrapping here, and require that the caller unwrap the IO function. That way it's a bit more explicit?

src/dask_awkward/layers/layers.py

agoose77 · 2023-11-29T22:04:19Z

@douglasdavis looks good! I made some comments.

In general, do you think it's possible / sensible to move the raise_on_error machinery to additional protocols that io_func implements? I.e. rather than wrapping the IO function, we explicitly ask it if it supports these features? My hunch is that will be slightly simpler, and we could still provide a convenience adaptor that does this for plain-callable IO functions.

Hopefully this review doesn't feel like a lot of nitpicks and lastminute.com suggestions. I'm glad you've thought about this, because it's important to get right, and I'm just adding some without-much-context thoughts.

douglasdavis · 2023-11-29T22:13:10Z

@agoose77 thanks a lot for the comments.

In general, do you think it's possible / sensible to move the raise_on_error machinery to additional protocols that io_func implements? I.e. rather than wrapping the IO function, we explicitly ask it if it supports these features? My hunch is that will be slightly simpler, and we could still provide a convenience adaptor that does this for plain-callable IO functions.

I have thought a bit about this, especially with the from_map arguments starting to blow up a bit, and with the specific changes which had to be made to handle the new complexities of wrapped io_funcs. I'll keep ruminating on it a bit.

Hopefully this review doesn't feel like a lot of nitpicks and lastminute.com suggestions. I'm glad you've thought about this, because it's important to get right, and I'm just adding some without-much-context thoughts.

Not at all!

douglasdavis · 2023-11-29T22:32:01Z

The more I think about it, the more I like smooshing most of the logic into having to live in the io_func classes. I'll see if I can refactor this into doing that and keeping the from_map function signature more simple.

src/dask_awkward/lib/io/io.py

douglasdavis marked this pull request as draft November 14, 2023 17:59

douglasdavis commented Nov 14, 2023

View reviewed changes

douglasdavis force-pushed the failure-report branch 3 times, most recently from 82ace9f to 31e570d Compare November 16, 2023 22:12

douglasdavis force-pushed the failure-report branch from 31e570d to 5dd9a13 Compare November 25, 2023 16:08

agoose77 self-requested a review November 26, 2023 20:46

douglasdavis marked this pull request as ready for review November 27, 2023 17:12

douglasdavis force-pushed the failure-report branch from b23fbc1 to b477902 Compare November 28, 2023 19:46

douglasdavis mentioned this pull request Nov 29, 2023

feat(draft): add report= argument to uproot.dask; triggers report collection for failed reads scikit-hep/uproot5#1050

Closed