Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ak.records_to_regular to convert [{"x": 1, "y": 2}, {"x": 3, "y": 4}] into [[1, 2], [3, 4]] #3257

Open
jpivarski opened this issue Sep 25, 2024 · 2 comments
Labels
feature New feature or request

Comments

@jpivarski
Copy link
Member

Description of new feature

Awkward Array's idiomatic form for data points with named features is to use RecordArray, which keeps each record field in a separate array (useful for loading or working with a subset of columns).

Machine learning libraries like to see a feature-set (an input vector into a neural network) as a regular dimension, either RegularArray or NumpyArray with inner_shape != () (which become the same thing after conversion out of Awkward). Unlike a RecordArray, the different features of the same vector are contiguous in memory.

Also unlike a RecordArray, the elements of a feature vector have no names. I do not know if there's a way to preserve these feature names, in PyTorch for instance, but it would be nice to do so in a conversion from Awkward Arrays into PyTorch Tensors.

ak.records_to_regular in which the records are one level deep,

>>> array = ak.Array([[{"pt": 0.0, "eta": 1.1}, {"pt": 2.2, "eta": 3.3}], [], [{"pt": 4.4, "eta": 5.5}]])

can be implemented as

>>> ak.unflatten(ak.concatenate(ak.unzip(array), axis=1), 2, axis=1)
<Array [[[0, 2.2], [1.1, 3.3]], ..., [[4.4, ...]]] type='3 * var * 2 * float64'>

but we're interested in a function that can be applied regardless of how deep the first level of records is. It would be written with recursively_apply. At some level of recursively_apply, you'd have passed through the list-type node and would be seeing the RecordArray directly:

>>> array = ak.Array([{"pt": 0.0, "eta": 1.1}, {"pt": 2.2, "eta": 3.3}, {"pt": 4.4, "eta": 5.5}])

and then you'd want to do something like

>>> ak.concatenate([x[:, np.newaxis] for x in ak.unzip(array)], axis=1)
<Array [[0, 1.1], [2.2, 3.3], [4.4, 5.5]] type='3 * 2 * float64'>

(preserves the length, 3, so it's good for recursively_apply).

This function would be useful for Awkward → ML conversions regardless of whether the data are ragged or not.

If more than one RecordArray is nested within each other, this function can be applied multiple times to turn each record-type into a dimension.

@jpivarski
Copy link
Member Author

@jpivarski
Copy link
Member Author

There's a NumPy function like this called np.lib.recfunctions.structured_to_unstructured.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant