Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: use pyarrow for string functions #2616

Merged
merged 75 commits into from
Aug 8, 2023
Merged
Show file tree
Hide file tree
Changes from 74 commits
Commits
Show all changes
75 commits
Select commit Hold shift + click to select a range
f499f27
First function is working: is_alnum.
jpivarski Aug 4, 2023
018b8e3
is_alpha
jpivarski Aug 4, 2023
1d97c32
is_decimal
jpivarski Aug 4, 2023
f3d2075
is_lower
jpivarski Aug 4, 2023
784dc68
is_digit
jpivarski Aug 4, 2023
73b346d
is_numeric
jpivarski Aug 4, 2023
eff2dfe
is_printable
jpivarski Aug 4, 2023
82b5a7b
is_space
jpivarski Aug 4, 2023
c8c669c
is_upper
jpivarski Aug 4, 2023
b9f9868
is_title
jpivarski Aug 4, 2023
88709b2
is_ascii; done with string predicates
jpivarski Aug 4, 2023
7a5463a
capitalize
jpivarski Aug 4, 2023
56cb0b1
lower
jpivarski Aug 4, 2023
2c1fe11
upper
jpivarski Aug 4, 2023
d7db042
upper
jpivarski Aug 4, 2023
951f9b9
title
jpivarski Aug 4, 2023
adab599
T -> T operations on bytestrings should return bytestrings.
jpivarski Aug 4, 2023
8279fde
repeat (the first that needs a broadcastable argument)
jpivarski Aug 5, 2023
4c41240
reverse (because it's easy)
jpivarski Aug 5, 2023
42604f0
replace_slice
jpivarski Aug 5, 2023
b69d7a2
replace_substring
jpivarski Aug 5, 2023
3d825aa
Also test 'max_replacements' in replace_substring.
jpivarski Aug 5, 2023
983c3ba
replace_substring_regex: done with string transforms
jpivarski Aug 5, 2023
bb8e8d7
center
jpivarski Aug 5, 2023
fa5d0bc
lpad and rpad
jpivarski Aug 5, 2023
99c4ce0
trim
jpivarski Aug 5, 2023
d713670
trim_whitespace
jpivarski Aug 5, 2023
e63bd3e
ltrim
jpivarski Aug 5, 2023
3040c4e
rtrim
jpivarski Aug 5, 2023
6320f2e
rtrim_whitespace
jpivarski Aug 5, 2023
3d0998b
ltrim_whitespace
jpivarski Aug 5, 2023
e624ee3
slice
jpivarski Aug 5, 2023
766c9df
feat: add `split_whitespace`
agoose77 Aug 7, 2023
c25a558
test: add test for `split_whitespace`
agoose77 Aug 7, 2023
ddc9bc7
test: correct test
agoose77 Aug 7, 2023
5638a79
feat: add `split_pattern`
agoose77 Aug 7, 2023
3ef7ded
refactor: rename `_get_action`
agoose77 Aug 7, 2023
65d2166
feat: add `ak_split_pattern_regex`
agoose77 Aug 7, 2023
0e26798
test: update tests for new features
agoose77 Aug 7, 2023
5ec706c
Fixed UnmaskedArray._drop_none.
jpivarski Aug 7, 2023
bd8e2e6
fix: adjust for numexpr 2.8.5, which hid getContext's frame_depth arg…
jpivarski Aug 7, 2023
73c8121
extract_regex.
jpivarski Aug 7, 2023
dc0746c
join (almost entirely from https://gist.github.com/agoose77/28e5bb025…
jpivarski Aug 7, 2023
43aa272
use dispatch correctly
jpivarski Aug 7, 2023
cbf1577
fix: drop unused arg
agoose77 Aug 7, 2023
068b6af
join_element_wise
jpivarski Aug 7, 2023
ffeef7b
Revert "use dispatch correctly"
agoose77 Aug 7, 2023
19c7197
fix: broadcast `num_repeats`
agoose77 Aug 7, 2023
21973bd
feat: add `count_substring[_pattern]`
agoose77 Aug 7, 2023
d385e61
docs: fixup docstring
agoose77 Aug 7, 2023
c9164d5
feat: add `ends_with`
agoose77 Aug 7, 2023
aac5e8a
feat: add `starts_with`
agoose77 Aug 7, 2023
17a6a0e
docs: fix link
agoose77 Aug 7, 2023
83f1597
feat: add `find_substring`
agoose77 Aug 7, 2023
6ad578f
docs: fix typo
agoose77 Aug 7, 2023
3141ebb
feat: add `find_substring_regex`
agoose77 Aug 7, 2023
4c69e86
docs: fix link
agoose77 Aug 7, 2023
8e230f4
feat: add `match_like`
agoose77 Aug 7, 2023
c676fbd
test: improve test
agoose77 Aug 7, 2023
99584ba
feat: add `match_substring`, `match_substring_regex`
agoose77 Aug 7, 2023
c456b44
feat: add `is_in` and `index_in`
agoose77 Aug 7, 2023
88f45cc
fix: operate at leaf depth
agoose77 Aug 7, 2023
6745ba2
refactor: add internal `pyarrow.compute` helper
agoose77 Aug 8, 2023
4422ad8
refactor: use pyarrow import helper
agoose77 Aug 8, 2023
ec6cefa
refactor: add `module` and `name` arguments to `high_level_function`
agoose77 Aug 8, 2023
307a3ea
fix: pass `module` to str `high_level_function`
agoose77 Aug 8, 2023
51a5c5c
docs: homogenize docstrings
agoose77 Aug 8, 2023
447cde7
docs: add see also
agoose77 Aug 8, 2023
cbba554
docs: include `ak.str` in toctree
agoose77 Aug 8, 2023
6e39bf1
chore: update pre-commit hooks (#2619)
pre-commit-ci[bot] Aug 8, 2023
9fee3fc
refactor: cleanup error handling
agoose77 Aug 8, 2023
a2ca690
Merge branch 'main' into jpivarski/use-pyarrow-for-strings
jpivarski Aug 8, 2023
c5f5cb7
Rename ak_*.py modules -> akstr_*.py.
jpivarski Aug 8, 2023
7bcb12c
docs: be explicit about `ak_str_`
agoose77 Aug 8, 2023
34d0184
Merge branch 'main' into jpivarski/use-pyarrow-for-strings
agoose77 Aug 8, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/prepare_docstrings.py
Original file line number Diff line number Diff line change
Expand Up @@ -303,6 +303,7 @@ def dofunction(link, linelink, shortname, name, astfcn):
.replace(".behaviors.string", "")
)
shortname = re.sub(r"\.operations\.ak_\w+", "", shortname)
shortname = re.sub(r"\.operations\.str\.akstr_\w+", ".str", shortname)
shortname = re.sub(r"\.(contents|types|forms)\.\w+", r".\1", shortname)

if (
Expand Down
73 changes: 73 additions & 0 deletions docs/reference/toctree.txt
Original file line number Diff line number Diff line change
Expand Up @@ -145,6 +145,79 @@
generated/ak.argcartesian
generated/ak.argcombinations

.. toctree::
:caption: String predicates

generated/ak.str.is_alnum
generated/ak.str.is_alpha
generated/ak.str.is_ascii
generated/ak.str.is_decimal
generated/ak.str.is_digit
generated/ak.str.is_lower
generated/ak.str.is_numeric
generated/ak.str.is_printable
generated/ak.str.is_space
generated/ak.str.is_title
generated/ak.str.is_upper

.. toctree::
:caption: String transforms

generated/ak.str.capitalize
generated/ak.str.length
generated/ak.str.lower
generated/ak.str.repeat
generated/ak.str.replace_slice
generated/ak.str.replace_substring
generated/ak.str.replace_substring_regex
generated/ak.str.reverse
generated/ak.str.swapcase
generated/ak.str.title
generated/ak.str.upper

.. toctree::
:caption: String padding and trimming

generated/ak.str.center
generated/ak.str.lpad
generated/ak.str.rpad
generated/ak.str.ltrim
generated/ak.str.ltrim_whitespace
generated/ak.str.rtrim
generated/ak.str.rtrim_whitespace
generated/ak.str.trim
generated/ak.str.trim_whitespace

.. toctree::
:caption: String splitting and joining

generated/ak.str.split_pattern
generated/ak.str.split_pattern_regex
generated/ak.str.split_whitespace
generated/ak.str.join
generated/ak.str.join_element_wise

.. toctree::
:caption: String slicing and decomposition

generated/ak.str.slice
generated/ak.str.extract_regex

.. toctree::
:caption: String containment tests

generated/ak.str.count_substring
generated/ak.str.count_substring_regex
generated/ak.str.ends_with
generated/ak.str.find_substring
generated/ak.str.find_substring_regex
generated/ak.str.index_in
generated/ak.str.is_in
generated/ak.str.match_like
generated/ak.str.match_substring
generated/ak.str.match_substring_regex
generated/ak.str.starts_with

.. toctree::
:caption: Value and type conversions

Expand Down
3 changes: 2 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -309,7 +309,8 @@ mccabe.max-complexity = 100
"src/awkward/_connect/*" = ["TID251"]
"src/awkward/__init__.py" = ["E402", "F401", "F403", "I001"]
"src/awkward/_ext.py" = ["F401"]
"src/awkward/operations/__init__.py" = ["F403"]
"src/awkward/operations/__init__.py" = ["F401", "F403"]
"src/awkward/operations/str/__init__.py" = ["F401", "F403", "I001"]
"src/awkward/_nplikes/*" = ["TID251"]
"src/awkward/_operators.py" = ["TID251"]
"tests*/*" = ["T20", "TID251"]
Expand Down
17 changes: 14 additions & 3 deletions src/awkward/_connect/pyarrow.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
# BSD 3-Clause License; see https://github.com/scikit-hep/awkward-1.0/blob/main/LICENSE
from __future__ import annotations

import json
from collections.abc import Iterable, Sized
from types import ModuleType

from packaging.version import parse as parse_version

Expand Down Expand Up @@ -36,13 +38,13 @@
error_message = "pyarrow 7.0.0 or later required for {0}"


def import_pyarrow(name):
def import_pyarrow(name: str) -> ModuleType:
if pyarrow is None:
raise ImportError(error_message.format(name))
return pyarrow


def import_pyarrow_parquet(name):
def import_pyarrow_parquet(name: str) -> ModuleType:
if pyarrow is None:
raise ImportError(error_message.format(name))

Expand All @@ -51,7 +53,16 @@ def import_pyarrow_parquet(name):
return out


def import_fsspec(name):
def import_pyarrow_compute(name: str) -> ModuleType:
if pyarrow is None:
raise ImportError(error_message.format(name))

import pyarrow.compute as out

return out


def import_fsspec(name: str) -> ModuleType:
try:
import fsspec

Expand Down
2 changes: 1 addition & 1 deletion src/awkward/contents/unmaskedarray.py
Original file line number Diff line number Diff line change
Expand Up @@ -491,7 +491,7 @@ def _remove_structure(self, backend, options):
return [self]

def _drop_none(self) -> Content:
return self.to_ByteMaskedArray(True)._drop_none()
return self.content

def _recursively_apply(
self, action, behavior, depth, depth_context, lateral_context, options
Expand Down
2 changes: 1 addition & 1 deletion src/awkward/operations/__init__.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# BSD 3-Clause License; see https://github.com/scikit-hep/awkward-1.0/blob/main/LICENSE
# ruff: noqa: F401

import awkward.operations.str
from awkward.operations.ak_all import *
from awkward.operations.ak_almost_equal import *
from awkward.operations.ak_any import *
Expand Down
205 changes: 205 additions & 0 deletions src/awkward/operations/str/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,205 @@
# BSD 3-Clause License; see https://github.com/scikit-hep/awkward-1.0/blob/main/LICENSE

# https://arrow.apache.org/docs/python/api/compute.html#string-predicates

# string predicates
from awkward.operations.str.akstr_is_alnum import *
from awkward.operations.str.akstr_is_alpha import *
from awkward.operations.str.akstr_is_decimal import *
from awkward.operations.str.akstr_is_digit import *
from awkward.operations.str.akstr_is_lower import *
from awkward.operations.str.akstr_is_numeric import *
from awkward.operations.str.akstr_is_printable import *
from awkward.operations.str.akstr_is_space import *
from awkward.operations.str.akstr_is_upper import *
from awkward.operations.str.akstr_is_title import *
from awkward.operations.str.akstr_is_ascii import *

# string transforms
from awkward.operations.str.akstr_capitalize import *
from awkward.operations.str.akstr_length import *
from awkward.operations.str.akstr_lower import *
from awkward.operations.str.akstr_swapcase import *
from awkward.operations.str.akstr_title import *
from awkward.operations.str.akstr_upper import *
from awkward.operations.str.akstr_repeat import *
from awkward.operations.str.akstr_replace_slice import *
from awkward.operations.str.akstr_reverse import *
from awkward.operations.str.akstr_replace_substring import *
from awkward.operations.str.akstr_replace_substring_regex import *

# string padding
from awkward.operations.str.akstr_center import *
from awkward.operations.str.akstr_lpad import *
from awkward.operations.str.akstr_rpad import *

# string trimming
from awkward.operations.str.akstr_ltrim import *
from awkward.operations.str.akstr_ltrim_whitespace import *
from awkward.operations.str.akstr_rtrim import *
from awkward.operations.str.akstr_rtrim_whitespace import *
from awkward.operations.str.akstr_trim import *
from awkward.operations.str.akstr_trim_whitespace import *

# string splitting
from awkward.operations.str.akstr_split_whitespace import *
from awkward.operations.str.akstr_split_pattern import *
from awkward.operations.str.akstr_split_pattern_regex import *

# string component extraction

from awkward.operations.str.akstr_extract_regex import *

# string joining

from awkward.operations.str.akstr_join import *
from awkward.operations.str.akstr_join_element_wise import *

# string slicing

from awkward.operations.str.akstr_slice import *

# containment tests

from awkward.operations.str.akstr_count_substring import *
from awkward.operations.str.akstr_count_substring_regex import *
from awkward.operations.str.akstr_ends_with import *
from awkward.operations.str.akstr_find_substring import *
from awkward.operations.str.akstr_find_substring_regex import *
from awkward.operations.str.akstr_index_in import *
from awkward.operations.str.akstr_is_in import *
from awkward.operations.str.akstr_match_like import *
from awkward.operations.str.akstr_match_substring import *
from awkward.operations.str.akstr_match_substring_regex import *
from awkward.operations.str.akstr_starts_with import *


def _get_ufunc_action(
utf8_function,
ascii_function,
*args,
bytestring_to_string=False,
**kwargs,
):
from awkward.operations.ak_from_arrow import from_arrow
from awkward.operations.ak_to_arrow import to_arrow

def action(layout, **absorb):
if layout.is_list and layout.parameter("__array__") == "string":
return from_arrow(
utf8_function(to_arrow(layout, extensionarray=False), *args, **kwargs),
highlevel=False,
)

elif layout.is_list and layout.parameter("__array__") == "bytestring":
if bytestring_to_string:
out = from_arrow(
ascii_function(
to_arrow(
layout.copy(
content=layout.content.copy(
parameters={"__array__": "char"}
),
parameters={"__array__": "string"},
),
extensionarray=False,
),
*args,
**kwargs,
),
highlevel=False,
)
if out.is_list and out.parameter("__array__") == "string":
out = out.copy(
content=out.content.copy(parameters={"__array__": "byte"}),
parameters={"__array__": "bytestring"},
)
return out

else:
return from_arrow(
ascii_function(
to_arrow(layout, extensionarray=False), *args, **kwargs
),
highlevel=False,
)

return action


def _erase_list_option(layout):
from awkward.contents.unmaskedarray import UnmaskedArray

assert layout.is_list
if layout.content.is_option:
assert isinstance(layout.content, UnmaskedArray)
return layout.copy(content=layout.content.content)
Comment on lines +134 to +136
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we only ever send non-missing strings to Arrow, it's fair to assume that the only outputs are non-missing as well, even if Arrow says that the type is potentially nullable.

Actually, I think that the non-nullable type is part of the information that's lost by setting extensionarray=False. But we want that because Arrow Compute only applies its string operations if it recognizes the type, and it doesn't recognize the type if it's an extension array.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me whether we're passing a nullable type, or whether Arrow is just deciding to return a nullable type.

I'm pretty sure that Arrow is just deciding to return a nullable type. Arrow's type system does not allow it to express non-nullability (the default is nullable) in the type objects themselves; non-nullability can only be expressed in fields. There's a two-level structure under each struct and Table: field (with name, nullability, and some other things), which contains type. If you have a raw list, not in a struct or Table, I don't think there's a place to put the non-nullability information.

These corner-cases are the reason we use extension arrays, to carry more information through Arrow and Parquet. But we can't use one here if we want Arrow Compute to recognize the input as strings.

else:
return layout


def _get_split_action(
utf8_function, ascii_function, *args, bytestring_to_string=False, **kwargs
):
from awkward.operations.ak_from_arrow import from_arrow
from awkward.operations.ak_to_arrow import to_arrow

def action(layout, **absorb):
if layout.is_list and layout.parameter("__array__") == "string":
return _erase_list_option(
from_arrow(
utf8_function(
to_arrow(layout, extensionarray=False),
*args,
**kwargs,
),
highlevel=False,
)
)

elif layout.is_list and layout.parameter("__array__") == "bytestring":
if bytestring_to_string:
out = _erase_list_option(
from_arrow(
ascii_function(
to_arrow(
layout.copy(
content=layout.content.copy(
parameters={"__array__": "char"}
),
parameters={"__array__": "string"},
),
extensionarray=False,
),
*args,
**kwargs,
),
highlevel=False,
)
)
assert out.is_list

assert (
out.content.is_list
and out.content.parameter("__array__") == "string"
)
return out.copy(
content=out.content.copy(
content=out.content.content.copy(
parameters={"__array__": "byte"}
),
parameters={"__array__": "bytestring"},
),
)

else:
return _erase_list_option(
from_arrow(
ascii_function(
to_arrow(layout, extensionarray=False), *args, **kwargs
),
highlevel=False,
)
)

return action
Loading