Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect LIKE and ILIKE result for NULL input, `%', and '%%' pattern #12637

Open
goldmedal opened this issue Sep 26, 2024 · 13 comments
Open

Incorrect LIKE and ILIKE result for NULL input, `%', and '%%' pattern #12637

goldmedal opened this issue Sep 26, 2024 · 13 comments
Assignees
Labels
bug Something isn't working waiting-on-upstream PR is waiting on an upstream dependency to be updated

Comments

@goldmedal
Copy link
Contributor

goldmedal commented Sep 26, 2024

Describe the bug

While working on #12415, I found the LIKE and ILIKE behavior differs between StringView and other string types. Given the following data and SQL:

DataFusion CLI v42.0.0
> create table test_source as values
  ('Andrew', 'X', 'datafusion📊🔥', '🔥'),
  ('Xiangpeng', 'Xiangpeng', 'datafusion数据融合', 'datafusion数据融合'),
  ('Raphael', 'R', 'datafusionДатаФусион', 'аФус'),
  (NULL, 'R', NULL, '🔥');
0 row(s) fetched. 
Elapsed 0.028 seconds.

> create table test_basic_operator_utf8view as
select
    arrow_cast(column1, 'Utf8View') as ascii_1,
    arrow_cast(column2, 'Utf8View') as ascii_2,
    arrow_cast(column3, 'Utf8View') as unicode_1,
    arrow_cast(column4, 'Utf8View') as unicode_2
from test_source;
0 row(s) fetched. 
Elapsed 0.002 seconds.

> create table test_basic_operator_utf8 as
select
    arrow_cast(column1, 'Utf8') as ascii_1,
    arrow_cast(column2, 'Utf8') as ascii_2,
    arrow_cast(column3, 'Utf8') as unicode_1,
    arrow_cast(column4, 'Utf8') as unicode_2
from test_source;
0 row(s) fetched. 
Elapsed 0.004 seconds.

-- StringView Table ( pattern contain % )
> select ascii_1, unicode_1,
       ascii_1 like 'An%' as ascii_like,
       unicode_1 like '%ion数据%' as unicode_like,
       ascii_1 ilike 'An%' as ascii_ilike,
       unicode_1 ilike '%ion数据%' as unicode_ilik
from test_basic_operator_utf8view;
+-----------+----------------------+------------+--------------+-------------+--------------+
| ascii_1   | unicode_1            | ascii_like | unicode_like | ascii_ilike | unicode_ilik |
+-----------+----------------------+------------+--------------+-------------+--------------+
| Andrew    | datafusion📊🔥       | true       | false        | true        | false        |
| Xiangpeng | datafusion数据融合   | false      | true         | false       | true         |
| Raphael   | datafusionДатаФусион | false      | false        | false       | false        |
|           |                      | false      | false        | false       |              |
+-----------+----------------------+------------+--------------+-------------+--------------+
4 row(s) fetched. 
Elapsed 0.004 seconds.

-- StringView Table ( pattern without % )
> select ascii_1, unicode_1,
       ascii_1 like 'An' as ascii_like,
       unicode_1 like '%ion数据' as unicode_like,
       ascii_1 ilike 'An' as ascii_ilike,
       unicode_1 ilike 'ion数据' as unicode_ilik
from test_basic_operator_utf8view;
+-----------+----------------------+------------+--------------+-------------+--------------+
| ascii_1   | unicode_1            | ascii_like | unicode_like | ascii_ilike | unicode_ilik |
+-----------+----------------------+------------+--------------+-------------+--------------+
| Andrew    | datafusion📊🔥       | false      | false        | false       | false        |
| Xiangpeng | datafusion数据融合   | false      | false        | false       | false        |
| Raphael   | datafusionДатаФусион | false      | false        | false       | false        |
|           |                      |            | false        |             |              |
+-----------+----------------------+------------+--------------+-------------+--------------+
4 row(s) fetched. 
Elapsed 0.004 seconds.

-- String Table (same as LargeString and DictionaryString)
> select ascii_1, unicode_1,
       ascii_1 like 'An%' as ascii_like,
       unicode_1 like '%ion数据%' as unicode_like,
       ascii_1 ilike 'An%' as ascii_ilike,
       unicode_1 ilike '%ion数据%' as unicode_ilik
from test_basic_operator_utf8;
+-----------+----------------------+------------+--------------+-------------+--------------+
| ascii_1   | unicode_1            | ascii_like | unicode_like | ascii_ilike | unicode_ilik |
+-----------+----------------------+------------+--------------+-------------+--------------+
| Andrew    | datafusion📊🔥       | true       | false        | true        | false        |
| Xiangpeng | datafusion数据融合   | false      | true         | false       | true         |
| Raphael   | datafusionДатаФусион | false      | false        | false       | false        |
|           |                      |            |              |             |              |
+-----------+----------------------+------------+--------------+-------------+--------------+
4 row(s) fetched. 
Elapsed 0.004 seconds.

When the input value is NULL, string type will return NULL but string view will return false. (Something is interesting about the ILIKE operation is different between ASCII-only and Unicode StringView 🤔 )

Some testing for StringView ScalarValue

When the matching pattern contains %, it will return false instead null.

> select
    arrow_cast(null, 'Utf8View') like '123' as utf8view_like_ascii,
    arrow_cast(null, 'Utf8View') ilike '123' as utf8view_ilike_ascii,
    arrow_cast(null, 'Utf8View') like '123%' as "utf8view_like_ascii%",
    arrow_cast(null, 'Utf8View') ilike '123%' as "utf8view_ilike_ascii%";
+---------------------+----------------------+----------------------+-----------------------+
| utf8view_like_ascii | utf8view_ilike_ascii | utf8view_like_ascii% | utf8view_ilike_ascii% |
+---------------------+----------------------+----------------------+-----------------------+
|                     |                      | false                | false                 |
+---------------------+----------------------+----------------------+-----------------------+
1 row(s) fetched. 
Elapsed 0.001 seconds.

> select    
    arrow_cast(null, 'Utf8View') like '數據' as "utf8view_like_unicode",
    arrow_cast(null, 'Utf8View') ilike '數據' as "utf8view_ilike_unicode",
    arrow_cast(null, 'Utf8View') like '數據%' as "utf8view_like_unicode%",
    arrow_cast(null, 'Utf8View') ilike '數據%' as "utf8view_ilike_unicode%";
+-----------------------+------------------------+------------------------+-------------------------+
| utf8view_like_unicode | utf8view_ilike_unicode | utf8view_like_unicode% | utf8view_ilike_unicode% |
+-----------------------+------------------------+------------------------+-------------------------+
|                       |                        | false                  |                         |
+-----------------------+------------------------+------------------------+-------------------------+
1 row(s) fetched. 
Elapsed 0.001 seconds.

Some testing for String ScalarValue (Same as LargeString and DictionaryString)

> select
    arrow_cast(null, 'Utf8') like '123' as utf8_like_ascii,
    arrow_cast(null, 'Utf8') ilike '123' as utf8_ilike_ascii,
    arrow_cast(null, 'Utf8') like '123%' as "utf8_like_ascii%",
    arrow_cast(null, 'Utf8') ilike '123%' as "utf8_ilike_ascii%";
+-----------------+------------------+------------------+-------------------+
| utf8_like_ascii | utf8_ilike_ascii | utf8_like_ascii% | utf8_ilike_ascii% |
+-----------------+------------------+------------------+-------------------+
|                 |                  |                  |                   |
+-----------------+------------------+------------------+-------------------+
1 row(s) fetched. 
Elapsed 0.001 seconds.

> select    
    arrow_cast(null, 'Utf8') like '數據' as "utf8_like_unicode",
    arrow_cast(null, 'Utf8') ilike '數據' as "utf8_ilike_unicode",
    arrow_cast(null, 'Utf8') like '數據%' as "utf8_like_unicode%",
    arrow_cast(null, 'Utf8') ilike '數據%' as "utf8_ilike_unicode%";
+-------------------+--------------------+--------------------+---------------------+
| utf8_like_unicode | utf8_ilike_unicode | utf8_like_unicode% | utf8_ilike_unicode% |
+-------------------+--------------------+--------------------+---------------------+
|                   |                    |                    |                     |
+-------------------+--------------------+--------------------+---------------------+
1 row(s) fetched. 
Elapsed 0.002 seconds.

To Reproduce

Run the SQLs mentioned above.

Expected behavior

I'm not really sure if the behavior of StringView is expected 🤔 but I think their behavior should be consistent.
If the input is null, the like and ilike should return null.

Additional context

No response

@goldmedal goldmedal added the bug Something isn't working label Sep 26, 2024
@alamb
Copy link
Contributor

alamb commented Sep 26, 2024

I'm not really sure if the behavior of StringView is expected 🤔 but I think their behavior should be consistent.

When the input value is NULL, string type will return NULL but string view will return false. (Something is interesting about the ILIKE operation is different between ASCII-only and Unicode StringView 🤔 )

I think if the input is null, the string view should also return null

@goldmedal
Copy link
Contributor Author

I think if the input is null, the string view should also return null

Thanks! I updated the expected behavior 👍

@findepi
Copy link
Member

findepi commented Oct 31, 2024

Concise repro. This should return NULL

> select arrow_cast(null, 'Utf8') LIKE '%';
+----------------------------------------------+
| arrow_cast(NULL,Utf8("Utf8")) LIKE Utf8("%") |
+----------------------------------------------+
| true                                         |
+----------------------------------------------+

@goldmedal can you please change issue title to something like Incorrect LIKE and ILIKE result for NULL input and `%` pattern?

@goldmedal goldmedal changed the title The LIKE and ILIKE behavior for NULL handling in StringView differs from other string types Incorrect LIKE and ILIKE result for NULL input and % pattern Oct 31, 2024
@goldmedal
Copy link
Contributor Author

@goldmedal can you please change issue title to something like Incorrect LIKE and ILIKE result for NULL input and `%` pattern?

Thanks for the suggestion 👍

@findepi
Copy link
Member

findepi commented Oct 31, 2024

LIKE -- apache/arrow-rs#6662

@findepi
Copy link
Member

findepi commented Nov 4, 2024

take

@findepi
Copy link
Member

findepi commented Nov 4, 2024

I plan to take care of ILIKE as well, once the above is accepted.

findepi added a commit to findepi/datafusion that referenced this issue Nov 5, 2024
`expr LIKE '%'` was previously simplified to `true`, but the expression
returns `NULL` when `expr` is null.
The conversion was conditional on `!is_null(expr)` which means "is not
always true, i.e. is not a null literal".

This commit adds correct simplification logic. It additionally expands
the rule coverage to include string view (Utf8View) and large string
(LargeUtf8). This allows writing shared test cases even despite
`utf8_view LIKE '%'` returning incorrect results at execution time
(tracked by apache#12637). I.e. the
simplification masks the bug for cases where pattern is statically
known.
@findepi
Copy link
Member

findepi commented Nov 5, 2024

For cases when pattern is static (majority), we can partially fix this issue by fixing a bug in the simplifier, and thus avoiding buggy execution path: #13259

findepi added a commit to findepi/datafusion that referenced this issue Nov 5, 2024
`expr LIKE '%'` was previously simplified to `true`, but the expression
returns `NULL` when `expr` is null.
The conversion was conditional on `!is_null(expr)` which means "is not
always true, i.e. is not a null literal".

This commit adds correct simplification logic. It additionally expands
the rule coverage to include string view (Utf8View) and large string
(LargeUtf8). This allows writing shared test cases even despite
`utf8_view LIKE '%'` returning incorrect results at execution time
(tracked by apache#12637). I.e. the
simplification masks the bug for cases where pattern is statically
known.
findepi added a commit to findepi/datafusion that referenced this issue Nov 5, 2024
`expr LIKE '%'` was previously simplified to `true`, but the expression
returns `NULL` when `expr` is null.
The conversion was conditional on `!is_null(expr)` which means "is not
always true, i.e. is not a null literal".

This commit adds correct simplification logic. It additionally expands
the rule coverage to include string view (Utf8View) and large string
(LargeUtf8). This allows writing shared test cases even despite
`utf8_view LIKE '%'` returning incorrect results at execution time
(tracked by apache#12637). I.e. the
simplification masks the bug for cases where pattern is statically
known.
@findepi
Copy link
Member

findepi commented Nov 5, 2024

@goldmedal can you maybe edit issue title once again to cover %% patterns too?
% static patterns are being fixed in #13259
%% test cases in #13259 are commented out pointing at this issue

(i will see maybe i can fix repeated percent there too?)

@goldmedal goldmedal changed the title Incorrect LIKE and ILIKE result for NULL input and % pattern Incorrect LIKE and ILIKE result for NULL input, `%', and '%%' pattern Nov 5, 2024
@goldmedal
Copy link
Contributor Author

Interesting, I haven't checked the source code but I tried different numbers of the % char. Only one and two will be wrong.

> select arrow_cast(null, 'Utf8View') like '%';
+--------------------------------------------------+
| arrow_cast(NULL,Utf8("Utf8View")) LIKE Utf8("%") |
+--------------------------------------------------+
| true                                             |
+--------------------------------------------------+
1 row(s) fetched. 
Elapsed 0.001 seconds.

> select arrow_cast(null, 'Utf8View') like '%%';
+---------------------------------------------------+
| arrow_cast(NULL,Utf8("Utf8View")) LIKE Utf8("%%") |
+---------------------------------------------------+
| true                                              |
+---------------------------------------------------+
1 row(s) fetched. 
Elapsed 0.001 seconds.

> select arrow_cast(null, 'Utf8View') like '%%%';
+----------------------------------------------------+
| arrow_cast(NULL,Utf8("Utf8View")) LIKE Utf8("%%%") |
+----------------------------------------------------+
|                                                    |
+----------------------------------------------------+
1 row(s) fetched. 
Elapsed 0.001 seconds.

> select arrow_cast(null, 'Utf8View') like '%%%%';
+-----------------------------------------------------+
| arrow_cast(NULL,Utf8("Utf8View")) LIKE Utf8("%%%%") |
+-----------------------------------------------------+
|                                                     |
+-----------------------------------------------------+
1 row(s) fetched. 
Elapsed 0.001 seconds.

It's a nice find 👍

findepi added a commit to findepi/datafusion that referenced this issue Nov 5, 2024
`expr LIKE '%'` was previously simplified to `true`, but the expression
returns `NULL` when `expr` is null.
The conversion was conditional on `!is_null(expr)` which means "is not
always true, i.e. is not a null literal".

This commit adds correct simplification logic. It additionally expands
the rule coverage to include string view (Utf8View) and large string
(LargeUtf8). This allows writing shared test cases even despite
`utf8_view LIKE '%'` returning incorrect results at execution time
(tracked by apache#12637). I.e. the
simplification masks the bug for cases where pattern is statically
known.
@findepi
Copy link
Member

findepi commented Nov 5, 2024

It looks like a missed optimization opportunity. %, %%, %...%% patterns should all behave the same.
in #13260 i am adding a simplifier to make them the same.

goldmedal pushed a commit that referenced this issue Nov 6, 2024
* Fix incorrect `... LIKE '%'` simplification

`expr LIKE '%'` was previously simplified to `true`, but the expression
returns `NULL` when `expr` is null.
The conversion was conditional on `!is_null(expr)` which means "is not
always true, i.e. is not a null literal".

This commit adds correct simplification logic. It additionally expands
the rule coverage to include string view (Utf8View) and large string
(LargeUtf8). This allows writing shared test cases even despite
`utf8_view LIKE '%'` returning incorrect results at execution time
(tracked by #12637). I.e. the
simplification masks the bug for cases where pattern is statically
known.

* fixup! Fix incorrect `... LIKE '%'` simplification

* fix tests (re review comments)
findepi added a commit to sdf-labs/arrow-datafusion that referenced this issue Nov 6, 2024
* Fix incorrect `... LIKE '%'` simplification

`expr LIKE '%'` was previously simplified to `true`, but the expression
returns `NULL` when `expr` is null.
The conversion was conditional on `!is_null(expr)` which means "is not
always true, i.e. is not a null literal".

This commit adds correct simplification logic. It additionally expands
the rule coverage to include string view (Utf8View) and large string
(LargeUtf8). This allows writing shared test cases even despite
`utf8_view LIKE '%'` returning incorrect results at execution time
(tracked by apache#12637). I.e. the
simplification masks the bug for cases where pattern is statically
known.

* fixup! Fix incorrect `... LIKE '%'` simplification

* fix tests (re review comments)
findepi added a commit to sdf-labs/arrow-datafusion that referenced this issue Nov 6, 2024
…che#13259)

* Fix incorrect `... LIKE '%'` simplification

`expr LIKE '%'` was previously simplified to `true`, but the expression
returns `NULL` when `expr` is null.
The conversion was conditional on `!is_null(expr)` which means "is not
always true, i.e. is not a null literal".

This commit adds correct simplification logic. It additionally expands
the rule coverage to include string view (Utf8View) and large string
(LargeUtf8). This allows writing shared test cases even despite
`utf8_view LIKE '%'` returning incorrect results at execution time
(tracked by apache#12637). I.e. the
simplification masks the bug for cases where pattern is statically
known.

* fixup! Fix incorrect `... LIKE '%'` simplification

* fix tests (re review comments)
findepi added a commit to sdf-labs/arrow-datafusion that referenced this issue Nov 6, 2024
…che#13259)

* Fix incorrect `... LIKE '%'` simplification

`expr LIKE '%'` was previously simplified to `true`, but the expression
returns `NULL` when `expr` is null.
The conversion was conditional on `!is_null(expr)` which means "is not
always true, i.e. is not a null literal".

This commit adds correct simplification logic. It additionally expands
the rule coverage to include string view (Utf8View) and large string
(LargeUtf8). This allows writing shared test cases even despite
`utf8_view LIKE '%'` returning incorrect results at execution time
(tracked by apache#12637). I.e. the
simplification masks the bug for cases where pattern is statically
known.

* fixup! Fix incorrect `... LIKE '%'` simplification

* fix tests (re review comments)
findepi added a commit to sdf-labs/arrow-datafusion that referenced this issue Nov 6, 2024
…che#13259)

* Fix incorrect `... LIKE '%'` simplification

`expr LIKE '%'` was previously simplified to `true`, but the expression
returns `NULL` when `expr` is null.
The conversion was conditional on `!is_null(expr)` which means "is not
always true, i.e. is not a null literal".

This commit adds correct simplification logic. It additionally expands
the rule coverage to include string view (Utf8View) and large string
(LargeUtf8). This allows writing shared test cases even despite
`utf8_view LIKE '%'` returning incorrect results at execution time
(tracked by apache#12637). I.e. the
simplification masks the bug for cases where pattern is statically
known.

* fixup! Fix incorrect `... LIKE '%'` simplification

* fix tests (re review comments)
@findepi
Copy link
Member

findepi commented Nov 11, 2024

I plan to take care of ILIKE as well, once the above is accepted.

done in apache/arrow-rs#6705

should we close this issue when we update arrow-rs dep?

@alamb alamb added the waiting-on-upstream PR is waiting on an upstream dependency to be updated label Nov 13, 2024
@alamb
Copy link
Contributor

alamb commented Nov 13, 2024

Marked as waiting on upstream

findepi added a commit to sdf-labs/arrow-datafusion that referenced this issue Nov 14, 2024
* Fix incorrect `... LIKE '%'` simplification

`expr LIKE '%'` was previously simplified to `true`, but the expression
returns `NULL` when `expr` is null.
The conversion was conditional on `!is_null(expr)` which means "is not
always true, i.e. is not a null literal".

This commit adds correct simplification logic. It additionally expands
the rule coverage to include string view (Utf8View) and large string
(LargeUtf8). This allows writing shared test cases even despite
`utf8_view LIKE '%'` returning incorrect results at execution time
(tracked by apache#12637). I.e. the
simplification masks the bug for cases where pattern is statically
known.

* fixup! Fix incorrect `... LIKE '%'` simplification

* fix tests (re review comments)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working waiting-on-upstream PR is waiting on an upstream dependency to be updated
Projects
None yet
Development

No branches or pull requests

3 participants