Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] FixedShapeTensorArray.to_numpy_ndarray() fails with numpy arrays of type string in Pyarrow-16, but works with v15 #43614

Open
bschindler opened this issue Aug 8, 2024 · 7 comments

Comments

@bschindler
Copy link

Describe the bug, including details regarding any error messages, version, and platform.

FixedShapeTensorArray.to_numpy_ndarray() fails with numpy arrays of type string

with pyarrow 16, whereas pyarrow 15 works. As a result, we are unable to load our feather files that were created with older pyarrow versions.
While we should not have stored those strings this way, we have now a significant library of files that contain such constructs.
Code to load the file:

from pathlib import Path
from typing import Any

import pyarrow as pa


def _extract_element_feather(column: Any):
    import pyarrow as pa

    if type(column) == pa.lib.FixedShapeTensorArray:
        return column.to_numpy_ndarray()
    else:
        return column.as_py()

with pa.ipc.open_file(Path("out.feather")) as reader:
    table = reader.get_batch(0)
    features = {
        k.name: _extract_element_feather(table[k.name])
        for k in table.schema
    }
 
print(features)

Test file: out.feather.zip

Pyarrow 16:

  File "/Users/bschindler/Library/Application Support/JetBrains/PyCharm2024.1/scratches/scratch_41.py", line 17, in <module>
    features = {
  File "/Users/bschindler/Library/Application Support/JetBrains/PyCharm2024.1/scratches/scratch_41.py", line 18, in <dictcomp>
    k.name: _extract_element_feather(table[k.na
me])
  File "/Users/bschindler/Library/Application Support/JetBrains/PyCharm2024.1/scratches/scratch_41.py", line 11, in _extract_element_feather
    return column.to_numpy_ndarray()

  File "pyarrow/array.pxi", line 4240, in pyarrow.lib.FixedShapeTensorArray.to_numpy_ndarray
  File "pyarrow/array.pxi", line 4265, in pyarrow.lib.FixedShapeTensorArray.to_tensor
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
  pyarrow.lib.ArrowTypeError: string is not valid data type for a tensor

Pyarrow 15:

 {'data': array([['Hello World']], dtype=object)}

Component(s)

Python

@rok
Copy link
Member

rok commented Aug 8, 2024

I wasn't aware FixedShapeTensorArray could be constructed with strings, are you sure your data round trips unchanged?

As a workaround you could change _extract_element_feather to this:

def _extract_element_feather(column: Any):
    import pyarrow as pa

    if type(column) == pa.lib.FixedShapeTensorArray:
        return column.storage.to_numpy(zero_copy_only=False)
    else:
        return column.as_py()

When FixedShapeTensorArray was built we didn't expect it'd be used for strings. However if there's need for it we can have a discussion and see if we could accommodate the usecase.

@jorisvandenbossche jorisvandenbossche changed the title FixedShapeTensorArray.to_numpy_ndarray() fails with numpy arrays of type string in Pyarrow-16, but works with v15 [Python] FixedShapeTensorArray.to_numpy_ndarray() fails with numpy arrays of type string in Pyarrow-16, but works with v15 Aug 8, 2024
@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Aug 8, 2024

I wasn't aware FixedShapeTensorArray could be constructed with strings ..
When FixedShapeTensorArray was built we didn't expect it'd be used for strings.

In theory nothing in the spec says that you can only use it for numerical data types (although that is of course the typical use case):

* Extension type parameters:
* **value_type** = the Arrow data type of individual tensor elements.

And given you can construct an extension array from the storage, you can indeed easily construct a FixedShapeTensorArray with any Arrow type:

>>> storage_arr = pa.array([["a", "b"], ["c", "d"], ["e", "f"]], pa.list_(pa.string(), 2))
>>> arr = pa.ExtensionArray.from_storage(pa.fixed_shape_tensor(pa.string(), (2, )), storage_arr)
>>> arr
<pyarrow.lib.FixedShapeTensorArray object at 0x7f4d12e41da0>
[
  [
    "a",
    "b"
  ],
  [
    "c",
    "d"
  ],
  [
    "e",
    "f"
  ]
]

Of course, just because this is supported in general, doesn't necessarily require that we support all data types in to_numpy_ndarray() specifically. But so this worked in 15.0, and no longer works now:

>>> arr.to_numpy_ndarray()
array([['a', 'b'],
       ['c', 'd'],
       ['e', 'f']], dtype=object)

From a user point of view, I don't directly see a reason to not support this conversion. The practical reason it no longer works is because of an implementation change in #37533 where we moved this conversion to a numpy array to the C++ level through first converting to a Tensor (i.e. the to_numpy_ndarray() method now just calls self.to_tensor().to_numpy()), and we do only support numerical data types for a Tensor.

@rok
Copy link
Member

rok commented Aug 8, 2024

In theory nothing in the spec says that you can only use it for numerical data types (although that is of course the typical use case)

Yes, but we can't really express string tensor with FixedShapeTensorArray. Perhaps we could with VariableShapeTensorArray where the last dimension is length of the string? :)

From a user point of view, I don't directly see a reason to not support this conversion. The practical reason it no longer works is because of an implementation change in #37533 where we moved this conversion to a numpy array to the C++ level through first converting to a Tensor (i.e. the to_numpy_ndarray() method now just calls self.to_tensor().to_numpy()), and we do only support numerical data types for a Tensor.

That's a good point. We can reintroduce the old path for types that can't be converted into Tensors (I suppose that'll only string anyway).

@jorisvandenbossche
Copy link
Member

Yes, but we can't really express string tensor with FixedShapeTensorArray. Perhaps we could with VariableShapeTensorArray where the last dimension is length of the string? :)

What do you mean with "express"? The "fixed shape" vs "variable shape" in the type name is about the number of values in the tensor elements, and not about whether those individual values are stored in a fixed width storage layout or not.
(of course, for conversion to tensor libraries like numpy or pytorch, this matters, but not for Arrow itself)

@rok
Copy link
Member

rok commented Aug 8, 2024

Yes, but we can't really express string tensor with FixedShapeTensorArray. Perhaps we could with VariableShapeTensorArray where the last dimension is length of the string? :)

What do you mean with "express"? The "fixed shape" vs "variable shape" in the type name is about the number of values in the tensor elements, and not about whether those individual values are stored in a fixed width storage layout or not. (of course, for conversion to tensor libraries like numpy or pytorch, this matters, but not for Arrow itself)

Ah, sorry, for some reason I thought FixedSizeList was also fixed-width, which it is not if we're storing strings.
Then FixedShapeTensorArray should be just fine for this case indeed.

What would then be the best way to enable to_numpy_ndarray?:

  1. variable bit-width Tensor to use the current codepath
  2. add new c++ method for this exact case
  3. reuse the python code path from 15.0.0

@bschindler
Copy link
Author

Just Fyi, we did not intend to store strings that way, we have changed our code to not do that. As far as I am concerned, I am totally fine with arrow refusing such constructs for writing. But for reading, given that it was possible before, having the flexibility would be nice.

@rok
Copy link
Member

rok commented Aug 13, 2024

Thanks for the context @bschindler and for letting us know about this!

@jorisvandenbossche do you have a preference on tackling this? If not I'd lean towards 3 for now and can open a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants