[FEA] Better scaling for simple regular expressions on long strings #14087

revans2 · 2023-09-12T14:22:02Z

Is your feature request related to a problem? Please describe.
In Spark we have had multiple customers that try to process really long strings with simple regular expressions. The reality is that in most cases they don't need a regular expression but the API/expression Spark exposes is for a regular expression so they use it. An example of this is string split, where what is being split on is a regular expression. They will split on things like a comma , or try to parse JSON like formatted strings by splitting on } or }} sequences. But they do this on very large strings. Strings that are over 120KiB in size. When this happens we see really bad performance on the GPU. Even worse than single threaded performance on the CPU to process the same data. Here is an example where we are essentially doing an explode(split(column_name, "}}")) on 500 rows. It is 500 rows, because the length of the strings involved end up making that about 1 row group in parquet, so this is the data that a single Spark task sees.

String Length	GPU Median Time	CPU Median Time	Hacked GPU Median Time
10	316	249	395
100	321	298	371
1,000	333	753	377
10,000	1,352	5,401	407
20,000	4,914	10,630	459
40,000	15,810	21,261	564
80,000	66,409	43,487	773
100,000	111,781	54,989	902
120,000	134,409	66,066	1,035
140,000	212,333	76,900	1,232

In this the Hacked GPU Median Time is when I hacked the Spark plugin to ignore the regular expression and instead use the non-regular expression CUDF API to split the string.

Describe the solution you'd like
In the Rapids Plugin we have put in place a number of optimizations where we will parse the regular expression and if possible transpile it to a string that we can do a non regular expression split on. We think it is worth pushing this type of optimization into CUDF itself and not just for splits. It would really be nice if CUDF could spend time to look for alternative ways to execute a regular expression, especially for really long string, that don't need a single thread per string to work properly.

Examples include

Seeing if a regular expression can be transpiled to static string and using an alternative string operation instead. This could apply to split, matches, etc.
When just checking if a regular expression matches a string we could match things like FOO.* and convert it into a starts with operation instead. This could also apply to contains or ends with.

I would like it in CUDF because I think it would benefit everyone, not just the Spark plugin, but also I think the RAPIDS team could do a better job in many cases of finding these optimizations than we are doing.

Describe alternatives you've considered
Update our own regular expression checker code to start doing more of these optimizations.

The text was updated successfully, but these errors were encountered:

GregoryKimball · 2023-09-18T19:23:12Z

Thank you @revans2 for raising this issue. It's true that often the faster regex accelerator is the one that doesn't use regex at all.

I would like it in CUDF because I think it would benefit everyone, not just the Spark plugin, but also I think the RAPIDS team could do a better job in many cases of finding these optimizations than we are doing.

We'll have to think more about the design of this. It doesn't seem like a great solution for libcudf to expose an option in the regex engine to try and avoid the regex engine. The best idea I have for now is that Regex transformation to Strings API could happen in a JIT submodule.

revans2 · 2023-09-18T19:57:30Z

It doesn't seem like a great solution for libcudf to expose an option in the regex engine to try and avoid the regex engine.

I don't see this as an option, I see it as an optimization. Why would an end user ever know or care that when they asked to split a string on the regular expression "}}" that CUDF called one kernel vs another and got the answer back much faster. We already do this kind of thing for long strings where we have separate kernels and look at the average length of the string to decide which to call. How is this any different?

revans2 added feature request New feature or request Needs Triage Need team to review and classify Performance Performance related issue Spark Functionality that helps Spark RAPIDS strings strings issues (C++ and Python) labels Sep 12, 2023

revans2 mentioned this issue Sep 12, 2023

[FEA] Allow } and }} to be transpiled to static strings NVIDIA/spark-rapids#9224

Closed

GregoryKimball added 0 - Backlog In queue waiting for assignment and removed Needs Triage Need team to review and classify labels Sep 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Better scaling for simple regular expressions on long strings #14087

[FEA] Better scaling for simple regular expressions on long strings #14087

revans2 commented Sep 12, 2023

GregoryKimball commented Sep 18, 2023 •

edited

Loading

revans2 commented Sep 18, 2023

[FEA] Better scaling for simple regular expressions on long strings #14087

[FEA] Better scaling for simple regular expressions on long strings #14087

Comments

revans2 commented Sep 12, 2023

GregoryKimball commented Sep 18, 2023 • edited Loading

revans2 commented Sep 18, 2023

GregoryKimball commented Sep 18, 2023 •

edited

Loading