[FEA] Better scaling for simple regular expressions on long strings #14087
Labels
0 - Backlog
In queue waiting for assignment
feature request
New feature or request
Performance
Performance related issue
Spark
Functionality that helps Spark RAPIDS
strings
strings issues (C++ and Python)
Is your feature request related to a problem? Please describe.
In Spark we have had multiple customers that try to process really long strings with simple regular expressions. The reality is that in most cases they don't need a regular expression but the API/expression Spark exposes is for a regular expression so they use it. An example of this is string split, where what is being split on is a regular expression. They will split on things like a comma
,
or try to parse JSON like formatted strings by splitting on}
or}}
sequences. But they do this on very large strings. Strings that are over 120KiB in size. When this happens we see really bad performance on the GPU. Even worse than single threaded performance on the CPU to process the same data. Here is an example where we are essentially doing anexplode(split(column_name, "}}"))
on 500 rows. It is 500 rows, because the length of the strings involved end up making that about 1 row group in parquet, so this is the data that a single Spark task sees.In this the
Hacked GPU Median Time
is when I hacked the Spark plugin to ignore the regular expression and instead use the non-regular expression CUDF API to split the string.Describe the solution you'd like
In the Rapids Plugin we have put in place a number of optimizations where we will parse the regular expression and if possible transpile it to a string that we can do a non regular expression split on. We think it is worth pushing this type of optimization into CUDF itself and not just for splits. It would really be nice if CUDF could spend time to look for alternative ways to execute a regular expression, especially for really long string, that don't need a single thread per string to work properly.
Examples include
FOO.*
and convert it into a starts with operation instead. This could also apply to contains or ends with.I would like it in CUDF because I think it would benefit everyone, not just the Spark plugin, but also I think the RAPIDS team could do a better job in many cases of finding these optimizations than we are doing.
Describe alternatives you've considered
Update our own regular expression checker code to start doing more of these optimizations.
The text was updated successfully, but these errors were encountered: