Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] StringViewArray.from_buffers does not seem to work as expected #44651

Closed
raulcd opened this issue Nov 5, 2024 · 6 comments
Closed
Assignees
Labels
Milestone

Comments

@raulcd
Copy link
Member

raulcd commented Nov 5, 2024

Describe the usage question you have. Please include as many useful details as possible.

I am trying to create a pyarrow string view array using from_buffers but it does not seem to be correctly supported, or I might not know how to use it:

I tried this basic snippet:

>>> import pyarrow as pa
>>> array1 = pa.array(['String longer than 12 characters', 'Short', None], type=pa.string_view())
>>> pa.StringViewArray.from_buffers(pa.string_view(), len(array1), array1.buffers())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow/array.pxi", line 1178, in pyarrow.lib.Array.from_buffers
ValueError: Type's expected number of buffers (2) did not match the passed number (3).

If I try removing the null bitmap buffers to only pass the views buffer + the data buffer it complains on the buffer size:

>>> pa.StringViewArray.from_buffers(pa.string_view(), len(array1), [array1.buffers()[1], array1.buffers()[2]])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow/array.pxi", line 1193, in pyarrow.lib.Array.from_buffers
  File "pyarrow/array.pxi", line 1686, in pyarrow.lib.Array.validate
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Buffer #1 too small in array of type string_view and length 3: expected at least 48 byte(s), got 32

Of course, if the buffer is wrong (just change the order to avoid the complaints about size) it fails but we probably shouldn't get a segmentation fault:

>>> pa.StringViewArray.from_buffers(pa.string_view(), len(array1), [array1.buffers()[2], array1.buffers()[1]])

Segmentation fault (core dumped)

I am unsure if this is just me not knowing how to use from_buffers or if there is an issue with string view.

Component(s)

Python

@raulcd raulcd added the Type: usage Issue is a user question label Nov 5, 2024
@raulcd
Copy link
Member Author

raulcd commented Nov 5, 2024

ok, the problem seems to be if there is any null because this works:

>>> array2 = pa.array(['String longer than 12 characters', 'Short'], type=pa.string_view())
>>> new_array = pa.StringViewArray.from_buffers(pa.string_view(), len(array2), [array2.buffers()[1], array2.buffers()[2]])
>>> new_array
<pyarrow.lib.StringViewArray object at 0x73fc88321f60>
[
  null,
  null
]
>>> new_array.buffers()
[<pyarrow.Buffer address=0x3bac00303c0 size=32 is_cpu=True is_mutable=True>, <pyarrow.Buffer address=0x3bac0030400 size=32 is_cpu=True is_mutable=True>]

even the original works if I cheat and tell from_buffers that the size is one less than expected (avoid the null at the end):

>>> array1 = pa.array(['String longer than 12 characters', 'Short', None], type=pa.string_view())
>>> pa.StringViewArray.from_buffers(pa.string_view(), len(array1)-1, [array1.buffers()[1], array1.buffers()[2]])
<pyarrow.lib.StringViewArray object at 0x73fc88321f60>
[
  null,
  null
]

This seems like a bug :)

@raulcd
Copy link
Member Author

raulcd commented Nov 5, 2024

cc @jorisvandenbossche @AlenkaF

@jorisvandenbossche
Copy link
Member

I remember that I had this on my (mental) to do list when initially implementing support for StringView in Python (#39652), but it seems I never got around opening an issue or PR for it.

I think we have to either override from_buffers in the subclass (as we do for some other types), or update the hardcoded check if type.num_buffers != len(buffers) to address the fact that for string/binary_view types we have a variadic number of buffers, so that we can't do this check for those types.

BTW, if you just remove that check for a moment, does your initial example then work?

@jorisvandenbossche
Copy link
Member

but it seems I never got around opening an issue or PR for it.

Ah, I added a bullet point mentioning it in the TODO list of #39633 ...

@raulcd
Copy link
Member Author

raulcd commented Nov 12, 2024

Yes, if I remove the check:

diff --git a/python/pyarrow/array.pxi b/python/pyarrow/array.pxi
index eaedbf1e3..4ddbbea32 100644
--- a/python/pyarrow/array.pxi
+++ b/python/pyarrow/array.pxi
@@ -1174,10 +1174,10 @@ cdef class Array(_PandasConvertible):
                              "({0}) did not match the passed number "
                              "({1}).".format(type.num_fields, len(children)))
 
-        if type.num_buffers != len(buffers):
-            raise ValueError("Type's expected number of buffers "
-                             "({0}) did not match the passed number "
-                             "({1}).".format(type.num_buffers, len(buffers)))
+        #if type.num_buffers != len(buffers):
+        #    raise ValueError("Type's expected number of buffers "
+        #                     "({0}) did not match the passed number "
+        #                     "({1}).".format(type.num_buffers, len(buffers)))
 
         for buf in buffers:
             # None will produce a null buffer pointer

then it works as expected:

>>> import pyarrow as pa
>>> array1 = pa.array(['String longer than 12 characters', 'Short', None], type=pa.string_view())
>>> new_array = pa.StringViewArray.from_buffers(pa.string_view(), len(array1), array1.buffers())
>>> new_array
<pyarrow.lib.StringViewArray object at 0x739d58dc92a0>
[
  "String longer than 12 characters",
  "Short",
  null
]

@raulcd raulcd self-assigned this Nov 12, 2024
raulcd added a commit to raulcd/arrow that referenced this issue Nov 12, 2024
raulcd added a commit that referenced this issue Nov 18, 2024
…on (#44701)

### Rationale for this change

Currently `from_buffers` is not working with StringView on Python because we validate against num_buffers. This only take into account the mandatory buffers but does not take into account the variadic_spec that can be present for both string_view and binary_view

### What changes are included in this PR?

Take into account whether the type contains a variadic_spec for the non-mandatory buffers and only check lower_bound number of buffers.

### Are these changes tested?

Yes, I've added a couple of tests.

### Are there any user-facing changes?

We are exposing a new method on the Python DataType. `has_variadic_buffers` which tells us whether the number of buffers expected is only lower-bounded by num_buffers.
* GitHub Issue: #44651

Authored-by: Raúl Cumplido <[email protected]>
Signed-off-by: Raúl Cumplido <[email protected]>
@raulcd raulcd added this to the 19.0.0 milestone Nov 18, 2024
@raulcd
Copy link
Member Author

raulcd commented Nov 18, 2024

Issue resolved by pull request 44701
#44701

@raulcd raulcd closed this as completed Nov 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants