-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] FixedShapeTensorArray.to_numpy_ndarray() fails with numpy arrays of type string in Pyarrow-16, but works with v15 #43614
Comments
I wasn't aware As a workaround you could change def _extract_element_feather(column: Any):
import pyarrow as pa
if type(column) == pa.lib.FixedShapeTensorArray:
return column.storage.to_numpy(zero_copy_only=False)
else:
return column.as_py() When |
In theory nothing in the spec says that you can only use it for numerical data types (although that is of course the typical use case): arrow/docs/source/format/CanonicalExtensions.rst Lines 87 to 89 in 3420c0d
And given you can construct an extension array from the storage, you can indeed easily construct a FixedShapeTensorArray with any Arrow type: >>> storage_arr = pa.array([["a", "b"], ["c", "d"], ["e", "f"]], pa.list_(pa.string(), 2))
>>> arr = pa.ExtensionArray.from_storage(pa.fixed_shape_tensor(pa.string(), (2, )), storage_arr)
>>> arr
<pyarrow.lib.FixedShapeTensorArray object at 0x7f4d12e41da0>
[
[
"a",
"b"
],
[
"c",
"d"
],
[
"e",
"f"
]
] Of course, just because this is supported in general, doesn't necessarily require that we support all data types in
From a user point of view, I don't directly see a reason to not support this conversion. The practical reason it no longer works is because of an implementation change in #37533 where we moved this conversion to a numpy array to the C++ level through first converting to a Tensor (i.e. the |
Yes, but we can't really express string tensor with
That's a good point. We can reintroduce the old path for types that can't be converted into Tensors (I suppose that'll only string anyway). |
What do you mean with "express"? The "fixed shape" vs "variable shape" in the type name is about the number of values in the tensor elements, and not about whether those individual values are stored in a fixed width storage layout or not. |
Ah, sorry, for some reason I thought What would then be the best way to enable
|
Just Fyi, we did not intend to store strings that way, we have changed our code to not do that. As far as I am concerned, I am totally fine with arrow refusing such constructs for writing. But for reading, given that it was possible before, having the flexibility would be nice. |
Thanks for the context @bschindler and for letting us know about this! @jorisvandenbossche do you have a preference on tackling this? If not I'd lean towards 3 for now and can open a PR. |
Describe the bug, including details regarding any error messages, version, and platform.
FixedShapeTensorArray.to_numpy_ndarray() fails with numpy arrays of type string
with pyarrow 16, whereas pyarrow 15 works. As a result, we are unable to load our feather files that were created with older pyarrow versions.
While we should not have stored those strings this way, we have now a significant library of files that contain such constructs.
Code to load the file:
Test file: out.feather.zip
Pyarrow 16:
Pyarrow 15:
Component(s)
Python
The text was updated successfully, but these errors were encountered: