-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: use pyarrow for string functions #2616
Merged
Merged
Changes from 74 commits
Commits
Show all changes
75 commits
Select commit
Hold shift + click to select a range
f499f27
First function is working: is_alnum.
jpivarski 018b8e3
is_alpha
jpivarski 1d97c32
is_decimal
jpivarski f3d2075
is_lower
jpivarski 784dc68
is_digit
jpivarski 73b346d
is_numeric
jpivarski eff2dfe
is_printable
jpivarski 82b5a7b
is_space
jpivarski c8c669c
is_upper
jpivarski b9f9868
is_title
jpivarski 88709b2
is_ascii; done with string predicates
jpivarski 7a5463a
capitalize
jpivarski 56cb0b1
lower
jpivarski 2c1fe11
upper
jpivarski d7db042
upper
jpivarski 951f9b9
title
jpivarski adab599
T -> T operations on bytestrings should return bytestrings.
jpivarski 8279fde
repeat (the first that needs a broadcastable argument)
jpivarski 4c41240
reverse (because it's easy)
jpivarski 42604f0
replace_slice
jpivarski b69d7a2
replace_substring
jpivarski 3d825aa
Also test 'max_replacements' in replace_substring.
jpivarski 983c3ba
replace_substring_regex: done with string transforms
jpivarski bb8e8d7
center
jpivarski fa5d0bc
lpad and rpad
jpivarski 99c4ce0
trim
jpivarski d713670
trim_whitespace
jpivarski e63bd3e
ltrim
jpivarski 3040c4e
rtrim
jpivarski 6320f2e
rtrim_whitespace
jpivarski 3d0998b
ltrim_whitespace
jpivarski e624ee3
slice
jpivarski 766c9df
feat: add `split_whitespace`
agoose77 c25a558
test: add test for `split_whitespace`
agoose77 ddc9bc7
test: correct test
agoose77 5638a79
feat: add `split_pattern`
agoose77 3ef7ded
refactor: rename `_get_action`
agoose77 65d2166
feat: add `ak_split_pattern_regex`
agoose77 0e26798
test: update tests for new features
agoose77 5ec706c
Fixed UnmaskedArray._drop_none.
jpivarski bd8e2e6
fix: adjust for numexpr 2.8.5, which hid getContext's frame_depth arg…
jpivarski 73c8121
extract_regex.
jpivarski dc0746c
join (almost entirely from https://gist.github.com/agoose77/28e5bb025…
jpivarski 43aa272
use dispatch correctly
jpivarski cbf1577
fix: drop unused arg
agoose77 068b6af
join_element_wise
jpivarski ffeef7b
Revert "use dispatch correctly"
agoose77 19c7197
fix: broadcast `num_repeats`
agoose77 21973bd
feat: add `count_substring[_pattern]`
agoose77 d385e61
docs: fixup docstring
agoose77 c9164d5
feat: add `ends_with`
agoose77 aac5e8a
feat: add `starts_with`
agoose77 17a6a0e
docs: fix link
agoose77 83f1597
feat: add `find_substring`
agoose77 6ad578f
docs: fix typo
agoose77 3141ebb
feat: add `find_substring_regex`
agoose77 4c69e86
docs: fix link
agoose77 8e230f4
feat: add `match_like`
agoose77 c676fbd
test: improve test
agoose77 99584ba
feat: add `match_substring`, `match_substring_regex`
agoose77 c456b44
feat: add `is_in` and `index_in`
agoose77 88f45cc
fix: operate at leaf depth
agoose77 6745ba2
refactor: add internal `pyarrow.compute` helper
agoose77 4422ad8
refactor: use pyarrow import helper
agoose77 ec6cefa
refactor: add `module` and `name` arguments to `high_level_function`
agoose77 307a3ea
fix: pass `module` to str `high_level_function`
agoose77 51a5c5c
docs: homogenize docstrings
agoose77 447cde7
docs: add see also
agoose77 cbba554
docs: include `ak.str` in toctree
agoose77 6e39bf1
chore: update pre-commit hooks (#2619)
pre-commit-ci[bot] 9fee3fc
refactor: cleanup error handling
agoose77 a2ca690
Merge branch 'main' into jpivarski/use-pyarrow-for-strings
jpivarski c5f5cb7
Rename ak_*.py modules -> akstr_*.py.
jpivarski 7bcb12c
docs: be explicit about `ak_str_`
agoose77 34d0184
Merge branch 'main' into jpivarski/use-pyarrow-for-strings
agoose77 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,205 @@ | ||
# BSD 3-Clause License; see https://github.com/scikit-hep/awkward-1.0/blob/main/LICENSE | ||
|
||
# https://arrow.apache.org/docs/python/api/compute.html#string-predicates | ||
|
||
# string predicates | ||
from awkward.operations.str.akstr_is_alnum import * | ||
from awkward.operations.str.akstr_is_alpha import * | ||
from awkward.operations.str.akstr_is_decimal import * | ||
from awkward.operations.str.akstr_is_digit import * | ||
from awkward.operations.str.akstr_is_lower import * | ||
from awkward.operations.str.akstr_is_numeric import * | ||
from awkward.operations.str.akstr_is_printable import * | ||
from awkward.operations.str.akstr_is_space import * | ||
from awkward.operations.str.akstr_is_upper import * | ||
from awkward.operations.str.akstr_is_title import * | ||
from awkward.operations.str.akstr_is_ascii import * | ||
|
||
# string transforms | ||
from awkward.operations.str.akstr_capitalize import * | ||
from awkward.operations.str.akstr_length import * | ||
from awkward.operations.str.akstr_lower import * | ||
from awkward.operations.str.akstr_swapcase import * | ||
from awkward.operations.str.akstr_title import * | ||
from awkward.operations.str.akstr_upper import * | ||
from awkward.operations.str.akstr_repeat import * | ||
from awkward.operations.str.akstr_replace_slice import * | ||
from awkward.operations.str.akstr_reverse import * | ||
from awkward.operations.str.akstr_replace_substring import * | ||
from awkward.operations.str.akstr_replace_substring_regex import * | ||
|
||
# string padding | ||
from awkward.operations.str.akstr_center import * | ||
from awkward.operations.str.akstr_lpad import * | ||
from awkward.operations.str.akstr_rpad import * | ||
|
||
# string trimming | ||
from awkward.operations.str.akstr_ltrim import * | ||
from awkward.operations.str.akstr_ltrim_whitespace import * | ||
from awkward.operations.str.akstr_rtrim import * | ||
from awkward.operations.str.akstr_rtrim_whitespace import * | ||
from awkward.operations.str.akstr_trim import * | ||
from awkward.operations.str.akstr_trim_whitespace import * | ||
|
||
# string splitting | ||
from awkward.operations.str.akstr_split_whitespace import * | ||
from awkward.operations.str.akstr_split_pattern import * | ||
from awkward.operations.str.akstr_split_pattern_regex import * | ||
|
||
# string component extraction | ||
|
||
from awkward.operations.str.akstr_extract_regex import * | ||
|
||
# string joining | ||
|
||
from awkward.operations.str.akstr_join import * | ||
from awkward.operations.str.akstr_join_element_wise import * | ||
|
||
# string slicing | ||
|
||
from awkward.operations.str.akstr_slice import * | ||
|
||
# containment tests | ||
|
||
from awkward.operations.str.akstr_count_substring import * | ||
from awkward.operations.str.akstr_count_substring_regex import * | ||
from awkward.operations.str.akstr_ends_with import * | ||
from awkward.operations.str.akstr_find_substring import * | ||
from awkward.operations.str.akstr_find_substring_regex import * | ||
from awkward.operations.str.akstr_index_in import * | ||
from awkward.operations.str.akstr_is_in import * | ||
from awkward.operations.str.akstr_match_like import * | ||
from awkward.operations.str.akstr_match_substring import * | ||
from awkward.operations.str.akstr_match_substring_regex import * | ||
from awkward.operations.str.akstr_starts_with import * | ||
|
||
|
||
def _get_ufunc_action( | ||
utf8_function, | ||
ascii_function, | ||
*args, | ||
bytestring_to_string=False, | ||
**kwargs, | ||
): | ||
from awkward.operations.ak_from_arrow import from_arrow | ||
from awkward.operations.ak_to_arrow import to_arrow | ||
|
||
def action(layout, **absorb): | ||
if layout.is_list and layout.parameter("__array__") == "string": | ||
return from_arrow( | ||
utf8_function(to_arrow(layout, extensionarray=False), *args, **kwargs), | ||
highlevel=False, | ||
) | ||
|
||
elif layout.is_list and layout.parameter("__array__") == "bytestring": | ||
if bytestring_to_string: | ||
out = from_arrow( | ||
ascii_function( | ||
to_arrow( | ||
layout.copy( | ||
content=layout.content.copy( | ||
parameters={"__array__": "char"} | ||
), | ||
parameters={"__array__": "string"}, | ||
), | ||
extensionarray=False, | ||
), | ||
*args, | ||
**kwargs, | ||
), | ||
highlevel=False, | ||
) | ||
if out.is_list and out.parameter("__array__") == "string": | ||
out = out.copy( | ||
content=out.content.copy(parameters={"__array__": "byte"}), | ||
parameters={"__array__": "bytestring"}, | ||
) | ||
return out | ||
|
||
else: | ||
return from_arrow( | ||
ascii_function( | ||
to_arrow(layout, extensionarray=False), *args, **kwargs | ||
), | ||
highlevel=False, | ||
) | ||
|
||
return action | ||
|
||
|
||
def _erase_list_option(layout): | ||
from awkward.contents.unmaskedarray import UnmaskedArray | ||
|
||
assert layout.is_list | ||
if layout.content.is_option: | ||
assert isinstance(layout.content, UnmaskedArray) | ||
return layout.copy(content=layout.content.content) | ||
else: | ||
return layout | ||
|
||
|
||
def _get_split_action( | ||
utf8_function, ascii_function, *args, bytestring_to_string=False, **kwargs | ||
): | ||
from awkward.operations.ak_from_arrow import from_arrow | ||
from awkward.operations.ak_to_arrow import to_arrow | ||
|
||
def action(layout, **absorb): | ||
if layout.is_list and layout.parameter("__array__") == "string": | ||
return _erase_list_option( | ||
from_arrow( | ||
utf8_function( | ||
to_arrow(layout, extensionarray=False), | ||
*args, | ||
**kwargs, | ||
), | ||
highlevel=False, | ||
) | ||
) | ||
|
||
elif layout.is_list and layout.parameter("__array__") == "bytestring": | ||
if bytestring_to_string: | ||
out = _erase_list_option( | ||
from_arrow( | ||
ascii_function( | ||
to_arrow( | ||
layout.copy( | ||
content=layout.content.copy( | ||
parameters={"__array__": "char"} | ||
), | ||
parameters={"__array__": "string"}, | ||
), | ||
extensionarray=False, | ||
), | ||
*args, | ||
**kwargs, | ||
), | ||
highlevel=False, | ||
) | ||
) | ||
assert out.is_list | ||
|
||
assert ( | ||
out.content.is_list | ||
and out.content.parameter("__array__") == "string" | ||
) | ||
return out.copy( | ||
content=out.content.copy( | ||
content=out.content.content.copy( | ||
parameters={"__array__": "byte"} | ||
), | ||
parameters={"__array__": "bytestring"}, | ||
), | ||
) | ||
|
||
else: | ||
return _erase_list_option( | ||
from_arrow( | ||
ascii_function( | ||
to_arrow(layout, extensionarray=False), *args, **kwargs | ||
), | ||
highlevel=False, | ||
) | ||
) | ||
|
||
return action |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we only ever send non-missing strings to Arrow, it's fair to assume that the only outputs are non-missing as well, even if Arrow says that the type is potentially nullable.
Actually, I think that the non-nullable type is part of the information that's lost by setting
extensionarray=False
. But we want that because Arrow Compute only applies its string operations if it recognizes the type, and it doesn't recognize the type if it's an extension array.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure that Arrow is just deciding to return a nullable type. Arrow's type system does not allow it to express non-nullability (the default is nullable) in the type objects themselves; non-nullability can only be expressed in fields. There's a two-level structure under each struct and Table: field (with name, nullability, and some other things), which contains type. If you have a raw list, not in a struct or Table, I don't think there's a place to put the non-nullability information.
These corner-cases are the reason we use extension arrays, to carry more information through Arrow and Parquet. But we can't use one here if we want Arrow Compute to recognize the input as strings.