Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: use pyarrow for string functions #2616

Merged
merged 75 commits into from
Aug 8, 2023

Conversation

jpivarski
Copy link
Member

@jpivarski jpivarski commented Aug 4, 2023

This PR adds support for vectorised string operations using PyArrow's pyarrow.compute module.
It implements all of the listed string predicates, string transforms, string padding, string trimming, string splitting, string component extraction, string joining, string slicing, and string containment tests.

String predicates

String transforms

  • capitalize Capitalize the first character of input.
  • length Compute UTF8 string lengths.
  • lower Transform input to lowercase.
  • swapcase Transform input lowercase characters to uppercase and uppercase characters to lowercase.
  • title Titlecase each word of input.
  • upper Transform input to uppercase.
  • repeat Repeat a binary string.
  • replace_slice Replace a slice of a string.
  • reverse Reverse input.
  • replace_substring Replace matching non-overlapping substrings with replacement.
  • replace_substring_regex Replace matching non-overlapping substrings with replacement.

String padding

  • center Center strings by padding with a given character.
  • lpad Right-align strings by padding with a given character.
  • rpad Left-align strings by padding with a given character.

String trimming

String splitting

String component extraction

String joining

  • join Join a list of strings together with a separator.
  • join_element_wise Join string arguments together, with the last argument as separator.

String slicing

Containment tests

@jpivarski jpivarski temporarily deployed to docs-preview August 4, 2023 21:45 — with GitHub Actions Inactive
@codecov
Copy link

codecov bot commented Aug 4, 2023

Codecov Report

❗ No coverage uploaded for pull request base (main@85ca6d5). Click here to learn what that means.
The diff coverage is 98.93%.

Additional details and impacted files
Files Changed Coverage Δ
src/awkward/_connect/pyarrow.py 91.15% <90.00%> (ø)
src/awkward/operations/str/akstr_join.py 92.68% <92.68%> (ø)
src/awkward/operations/str/akstr_repeat.py 94.44% <94.44%> (ø)
.../awkward/operations/str/akstr_join_element_wise.py 95.65% <95.65%> (ø)
src/awkward/operations/str/akstr_index_in.py 96.42% <96.42%> (ø)
src/awkward/operations/str/akstr_is_in.py 96.42% <96.42%> (ø)
src/awkward/operations/str/__init__.py 98.80% <98.80%> (ø)
src/awkward/contents/unmaskedarray.py 73.62% <100.00%> (ø)
src/awkward/operations/__init__.py 100.00% <100.00%> (ø)
src/awkward/operations/str/akstr_capitalize.py 100.00% <100.00%> (ø)
... and 43 more

@jpivarski jpivarski temporarily deployed to docs-preview August 4, 2023 21:59 — with GitHub Actions Inactive
@jpivarski jpivarski temporarily deployed to docs-preview August 4, 2023 22:14 — with GitHub Actions Inactive
@jpivarski jpivarski temporarily deployed to docs-preview August 4, 2023 22:36 — with GitHub Actions Inactive
@jpivarski jpivarski temporarily deployed to docs-preview August 4, 2023 22:43 — with GitHub Actions Inactive
@jpivarski jpivarski temporarily deployed to docs-preview August 4, 2023 22:54 — with GitHub Actions Inactive
@jpivarski jpivarski temporarily deployed to docs-preview August 4, 2023 23:08 — with GitHub Actions Inactive
@jpivarski jpivarski temporarily deployed to docs-preview August 4, 2023 23:20 — with GitHub Actions Inactive
@jpivarski jpivarski temporarily deployed to docs-preview August 4, 2023 23:35 — with GitHub Actions Inactive
@jpivarski jpivarski temporarily deployed to docs-preview August 5, 2023 00:21 — with GitHub Actions Inactive
@jpivarski jpivarski temporarily deployed to docs-preview August 5, 2023 00:34 — with GitHub Actions Inactive
@jpivarski jpivarski temporarily deployed to docs-preview August 5, 2023 00:55 — with GitHub Actions Inactive
@jpivarski jpivarski temporarily deployed to docs-preview August 5, 2023 01:13 — with GitHub Actions Inactive
@jpivarski jpivarski temporarily deployed to docs-preview August 5, 2023 01:32 — with GitHub Actions Inactive
@jpivarski jpivarski temporarily deployed to docs-preview August 5, 2023 01:52 — with GitHub Actions Inactive
@jpivarski jpivarski temporarily deployed to docs-preview August 5, 2023 02:18 — with GitHub Actions Inactive
@jpivarski
Copy link
Member Author

The split_* functions don't look fundamentally hard. They have to pass parameters (constants, not broadcasted arrays) and they output lists of strings instead of strings. If the bytestring_to_string=True option is needed for bytestrings (very unlikely!), then the string → bytestring repacking will need to be applied one level deeper (but I don't think that bytestring_to_string=True will be needed).

The extract_regex function is straightforward to call, but it returns an Arrow struct. If the record field names are formulaic ("\1", "\2", etc.), then we'd probably want to set the fields of the output to None so that it becomes an Awkward tuple. (We can assign to the output in place because it was just created by ak.from_arrow, and the fields is not the zero-copy data.)

The only remaining function that needs to be broadcasted (like binary_repeat) is join_element_wise. Unlike binary_repeat, there's no choice about whether to broadcast or not, and there can be more than two arrays to broadcast. (All of the *strings arrays documented here would have to be broadcasted). This will probably be the hardest one.

By comparison, join (documented here) does not broadcast, but it needs to be applied at the level of lists of strings, not just strings (one level deep). It's a reducer.

Two of the containment tests take an array-like second argument: index_in and is_in. This argument is value_set (see index_in and is_in, which is a set of strings that each string in the values is compared against, but it is not broadcasted to values. So it's much easier than join_element_wise.

None of the remaining functions look sufficiently different from the ones that have been implemented so far as to be a problem.

@agoose77
Copy link
Collaborator

agoose77 commented Aug 6, 2023

Jim, this is a phenomenal piece of work. Looking forward to reviewing it.

P.S. I touched your PR description to fix the links. Correct anything if you feel I made a mistake.

@agoose77 agoose77 temporarily deployed to docs-preview August 7, 2023 09:27 — with GitHub Actions Inactive
@agoose77 agoose77 force-pushed the jpivarski/use-pyarrow-for-strings branch from 461eb83 to 258dab9 Compare August 7, 2023 09:44
@agoose77 agoose77 temporarily deployed to docs-preview August 7, 2023 09:51 — with GitHub Actions Inactive
@agoose77 agoose77 temporarily deployed to docs-preview August 7, 2023 10:30 — with GitHub Actions Inactive
@agoose77
Copy link
Collaborator

agoose77 commented Aug 7, 2023

I've added split_whitespace, split_pattern, and split_pattern_regex. I've noticed that these operations return nullable types, so the round-trip logic in _get_action doesn't correctly re-interpret the result as bytestrings if the input is a bytestring (and bytestring_to_string=True).

As such, I added a new _get_split_action that handles the case where a returned list contains a nullable string. I don't *think* we should ever get anything besides an UnmaskedArray` here, as we never pass any null data to Arrow.

It's not clear to me whether we're passing a nullable type, or whether Arrow is just deciding to return a nullable type. My hunch was that if the input StringArray is nullable, then the result is nullable, but I can't check whether the type we build is considered nullable by arrow; large_string or large_binary don't seem to expose this information.

I haven't added tests here yet, but they should be fairly easy.

@agoose77 agoose77 temporarily deployed to docs-preview August 7, 2023 11:52 — with GitHub Actions Inactive
@agoose77 agoose77 temporarily deployed to docs-preview August 7, 2023 12:19 — with GitHub Actions Inactive
@agoose77
Copy link
Collaborator

agoose77 commented Aug 7, 2023

binary_join is not too tricky, but it looks like the Arrow kernel only supports nullable, small strings. A first-pass is here, although it should be cleaned up. I just played around to get it working, rather than thinking through robustly what the recursion function should look like.

When working on these functions, two questions come to mind:

  1. Should we perform validation logic in Awkward, rather than relying on Arrow to error if an invalid type is passed?
  2. Should we treat operations on incompatible types (e.g. string operations on non-strings, or on record arrays) as errors, rather than silently returning the original array?

On both of these points, I think the answer is yes. There are some Awkward functions for which being permissive is perhaps a good thing, e.g. ak.values_astype. However, in general we have type information available and I think it would be useful for users to be able to detect operations upon the "wrong" types as early as possible. Do you agree, @jpivarski?

Comment on lines +117 to +136
if layout.content.is_option:
assert isinstance(layout.content, UnmaskedArray)
return layout.copy(content=layout.content.content)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we only ever send non-missing strings to Arrow, it's fair to assume that the only outputs are non-missing as well, even if Arrow says that the type is potentially nullable.

Actually, I think that the non-nullable type is part of the information that's lost by setting extensionarray=False. But we want that because Arrow Compute only applies its string operations if it recognizes the type, and it doesn't recognize the type if it's an extension array.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me whether we're passing a nullable type, or whether Arrow is just deciding to return a nullable type.

I'm pretty sure that Arrow is just deciding to return a nullable type. Arrow's type system does not allow it to express non-nullability (the default is nullable) in the type objects themselves; non-nullability can only be expressed in fields. There's a two-level structure under each struct and Table: field (with name, nullability, and some other things), which contains type. If you have a raw list, not in a struct or Table, I don't think there's a place to put the non-nullability information.

These corner-cases are the reason we use extension arrays, to carry more information through Arrow and Parquet. But we can't use one here if we want Arrow Compute to recognize the input as strings.

@jpivarski jpivarski temporarily deployed to docs-preview August 7, 2023 18:11 — with GitHub Actions Inactive
@agoose77 agoose77 force-pushed the jpivarski/use-pyarrow-for-strings branch from c78a499 to 6e39bf1 Compare August 8, 2023 10:56
@agoose77 agoose77 temporarily deployed to docs-preview August 8, 2023 11:03 — with GitHub Actions Inactive
@agoose77 agoose77 marked this pull request as ready for review August 8, 2023 11:06
@agoose77
Copy link
Collaborator

agoose77 commented Aug 8, 2023

@jpivarski I think this is ready to go! I see the merit in being permissive for our string operations, and I think upon reflection that most of our functions are like this.

I have one outstanding API question that we need to figure out — whether to drop the ak_ prefix for string function module names? I think we should, and it suggests that the module names in ak.operations should also be changed. I don't know the motivation for the ak_ prefix; it helps with indicating that a module contains a high-level function, but that's not a strong motivation for keeping it.

@agoose77 agoose77 temporarily deployed to docs-preview August 8, 2023 11:13 — with GitHub Actions Inactive
@jpivarski
Copy link
Member Author

The ak_ prefix was to distinguish between modules and the high-level functions they contain. Some of the high-level functions have the same names as Python built-ins, so I wanted something to not spread that degeneracy into the module names as well.

You're right that ak_ makes a lot less sense for these functions, which aren't ak.* but ak.str.*. I wasn't sure what to do about that when starting this PR, but I just went with the flow. We could use a ak_str_ prefix, adding boilerplate, but it would be less confusing. If someone has a nested, indented file-browser, digging more deeply into the file structure would indent to the right because these functions are in a directory named str, but also because their names have longer prefixes.

Maybe all of the modules should be prefixed by something neutral, like m_ for module (not to be confused with "private" in C++), h_ for high-level function, or o_ for operation.

I don't think this needs to be decided now.

It's great that all of these functions are done! We should tell @martindurant that he can try them out on some log-file data whenever he gets a chance. I'll merge this (unclear who should review it) after testing against all other outstanding merges into main.

@agoose77
Copy link
Collaborator

agoose77 commented Aug 8, 2023

If we don't sort this now, well have to change the names in a breaking way down the road (only breaking for imports, which resolve file paths under the hood). That's already true for ak.operations.

I don't think we should make the ak.operations module private though - I think we want public apis to live in public modules. I'm still thinking that through, though.

@jpivarski
Copy link
Member Author

Technically true, but should people really be accessing ak.XYZ through ak.operations.XYZ or ak.operations.ak_XYZ.XYZ anyway? It seems like there should be only one way to access these functions; putting them inside directories is for our own organization. I'd be happy with operations_operations.

@martindurant
Copy link
Contributor

I suppose this means there's no longer any need for the same functionality in awkward-pandas, but I do think it would like to keep a str accessor whatever you decide here, since that's the pandas pattern.

How is the performance?

Yes, agree that nice log datasets would be perfect; https://github.com/logpai/loghub ?

@jpivarski
Copy link
Member Author

Performance should be whatever Arrow provides, which I expect to be good. (All of those functions are C++, exposed through Cython.)

It looks like the log files in that GitHub repo are not the full datasets. Some of them are supposedly 20‒30 GB, but the file itself looks like a few kB. Besides, if they really put many GB in a git repo, it would be very hard to clone.

We'd also need something to preprocess it, to get it into an Awkward format. That text processing would take some significant time.

Are any of them JSON? We already have an optimized JSON parser.

@martindurant
Copy link
Contributor

The full data files are at https://zenodo.org/record/8196385 (zip and tar.gz files). The repo contains all the data descriptions as READMEs.

@jpivarski
Copy link
Member Author

I renamed the modules in ak.operations.str as akstr_* to acknowledge that they're not in the ak.* namespace, but the ak.str.* namespace. I think this will at least be a non-confusing choice (and I thought the extra underscore between "ak" and "str" is gratuitous). Eventually, I think ak.operations should be private (L3), but that can be later and it will require a deprecation cycle.

I'm guessing this is not controversial, so I'll enable auto-merge, but you can stop it and we'll discuss if you have another idea.

@jpivarski jpivarski enabled auto-merge (squash) August 8, 2023 15:08
shortname = re.sub(r"\.operations\.str\.ak_\w+", ".str", shortname)
shortname = re.sub(r"\.operations\.str\.akstr_\w+", ".str", shortname)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@agoose77 agoose77 temporarily deployed to docs-preview August 8, 2023 15:19 — with GitHub Actions Inactive
@agoose77 agoose77 temporarily deployed to docs-preview August 8, 2023 17:54 — with GitHub Actions Inactive
@jpivarski jpivarski merged commit 1cfea2f into main Aug 8, 2023
32 checks passed
@jpivarski jpivarski deleted the jpivarski/use-pyarrow-for-strings branch August 8, 2023 18:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants