[C++] Use of Standatrd C++ callable in Filtering Dataset #44608

stellarpower · 2024-11-01T15:56:07Z

Describe the enhancement requested

Hi,

Somewhat new to Arrow - I've used the basics before briefly and am aware it underpins many other tools I have used, but only needed to get my feet a little wet in that actual API use so far.

I was taking some data fro ma file, and constructing a filter on it, and then I wanted to transform one of the columns by calling a function with the value for each row. I know there are Compute functions, and have also learned a bit about Gandiva - but I'm kinda surprised that after a few hours of googling I don't seem to have found a straightforward way of applying my own callable (i.e. a std::function, or similar) in the filtering pipeline, without going through a relatively lengthy process of registering it with the compute function registry and specifying a lot of boilerplate. Maybe this exists and I could be pointed in the right direction, but thus far I haven't seen anything indicating this would be currently possible. Form what I can gather, the R and Python packages do allow use of something like a lambda, albeit in the local language, in the filter pipelines, but for C++ I'm not aware of a way to do this.

In my case, taking the example code for filtering:

... Open a dataset here.

// Read specified columns with a row filter
ARROW_ASSIGN_OR_RAISE(auto scan_builder, dataset->NewScan());
ARROW_RETURN_NOT_OK(scan_builder->Project({"b"}));

ARROW_RETURN_NOT_OK(scan_builder->Filter(
    cp::less(  cp::field_ref("b"), cp::literal(4)  )
));

ARROW_ASSIGN_OR_RAISE(auto scanner, scan_builder->Finish());

return scanner->ToTable();

I'd like to be able to pass in some sort of native C++ callable, with relatively few lines, and have this called whilst iterating over the data:

... Open a dataset here.

// Read specified columns with a row filter
ARROW_ASSIGN_OR_RAISE(auto scan_builder, dataset->NewScan());
ARROW_RETURN_NOT_OK(scan_builder->Project({"b"}));

ARROW_RETURN_NOT_OK(scan_builder->Filter(
    cp::makeFunction(
        [&]<Scalar>(const Scalar &cellValue){
            return someComplicatedObject->someComplicatedFunction("hello", "world", cellValue);
        }
    )
));

ARROW_ASSIGN_OR_RAISE(auto scanner, scan_builder->Finish());

return scanner->ToTable();

as just one hypothetical example of what this might look like.

In my case the function is stateful - it's not a pure function, so couldn't be created through a combination of primitives in the compute library, and Gandiva whilst possible, I don't think will work that easily as exposing a C API would make things rather ugly. Also for this usecase whilst I like the idea of making a kernel through IR lowering, it's overkill for what I need it for, and I'd happily forego it in this scenario just to have the ease of giving Arrow a native callable and not having to generate a lot of boilerplate myself.

I expect managing parallelism, as well as the way types are handled etc. would be a significant sticking point here - I don't yet know much about how arrow handles them, co-ercion, higher-level ocncepts like "numeric" as opposed to concrete types like doubles or size_ts, etc., but lso on this point Ithink:

the template system has the capacity to generate a lot of boilerplate if needed, so it oculd be stuff still needs to happen to register a function, but it can be generated for me
or I could also specify some kind of tuple that lists the Arrow datatypes he function could support, and it would perform the checks and then turn around and call the function with hte decltype of the callable,
I'd be happy with a runtime exception - or even a segfault, to be honest - if something goes wrong between the dataset#s schema and my function, and
as it's kinda being use in a specific scenario and I'm specifying a function directly, I don't believe it needs to be that universal.

For example, because this is quite specific to the data I'm operating with, the datatypes and the structure of it, my gut feel is it would be fine for the function to work in more limited scenarios, and whilst the onus of managing the interface could be placed on Arrow, the onus of specifying the datatypes correctly or carving out the scenario under which this is well-formed can justifiably fall to the user. I think this sort of scenario matches with a lambda or a custom function call - if it's not generic but rather specific or custom logic that the user needs to perform, then it's less of an issue if it's somewhat tied down to how datatypes are handled and would error out or blow up if the actual data that come in aren't what was specified. My function only works on strings, it'd be an existing issue if the column I tried to call it on didn't contain strings, or were missing, and so in my mind limiting the scope and either assuming the data will be a string, or only supporting a string, that's preferable to having to register a custom function with a registry, and specify one or several different ways in which it could work, which would be overkill for my usecase.

Similarly on parallelism, if my function is thread-safe, then I'd happily specify that myself, and if not, indicate that the data will need to be iterated on one core at most for this part of the filter.

My current way of doing this is filtering some columns I don't need out of the dataset, then iterating over every row in bathes and adding an array of boolean flags to indicate if we keep that row or not, then creating a new table using new record batches with that appended column, filtering it again to remove the discarded rows, and finally projecting to remove that column also. This is a row filter, but I think the same idea holds for something like a map - let's say we wanted to take a numerical column with the brightness of an observed start and wanted to run some non-trivial calculation to estimate its mass or how far it might be from Earth. The dataset filtering code in the example is really nice and terse, and also quite readable, and I let arrow work out how to perform the actual implementation and just declare what I want to do in the data pipeline. So what I have is a lot more verbose and isn't ideal from that perspective. And I'm not using larger-than-memory data so it's not an issue, but I expect it could be, or at least, would have to be planned more carefully, for perfoming a filter on enormous datasets.

So, I don't know if this is possible, but, if there could be a way of managing it it'd be a very nice interface to have in the C++ APIs, and I think could potentially save a lot of time as well as add flexibility for a user-programmer writing data pipelines.

Thanks!

Component(s)

C++

stellarpower added the Type: enhancement label Nov 1, 2024

github-actions bot added the Component: C++ label Nov 1, 2024

kou changed the title ~~[C++]: Use of Standatrd C++ callable in Filtering Dataset~~ [C++] Use of Standatrd C++ callable in Filtering Dataset Nov 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++] Use of Standatrd C++ callable in Filtering Dataset #44608

[C++] Use of Standatrd C++ callable in Filtering Dataset #44608

stellarpower commented Nov 1, 2024

[C++] Use of Standatrd C++ callable in Filtering Dataset #44608

[C++] Use of Standatrd C++ callable in Filtering Dataset #44608

Comments

stellarpower commented Nov 1, 2024

Describe the enhancement requested

Component(s)