Cleanup and fix EscapeQuerySyntaxImpl #12973

sabi0 · 2023-12-24T22:30:48Z

No description provided.

dweiss · 2023-12-28T18:46:38Z

...r/src/java/org/apache/lucene/queryparser/flexible/standard/parser/EscapeQuerySyntaxImpl.java

      for (int i = 0; i < count; i++) {
-        result.append(string.charAt(i));


Would it be possible to add a test for this, since you've found what looks like a bug? Thanks!

The escapeIgnoringCase method is private. It is called in three places, all looking like this:

for (String escapableQuotedChar : escapableQuotedChars) { buffer = escapeIgnoringCase(buffer, escapableQuotedChar.toLowerCase(locale), "\\", locale); }

I.e. the input for the search string sequence1 parameter is always controlled and is never an empty string:

private static final String[] escapableTermChars = { "\"", "<", ">", "=", "!", "(", ")", "^", "[", "{", ":", "]", "}", "~", "/" }; private static final String[] escapableQuotedChars = {"\""}; private static final String[] escapableWhiteChars = {" ", "\t", "\n", "\r", "\f", "\b", "\u3000"};

(unless some weird locale drops one of those characters when converting to lower case)

I wonder if this whole "empty search string" block should be replaced with an IllegalArgumentException?

I am not that familiar with this code but I think it'd be good to keep the cosmetic cleanups separate from functional changes - if you don't mind, I'll push this change first, then you can come up with a more focused cleanup?

dweiss · 2023-12-28T18:47:36Z

...r/src/java/org/apache/lucene/queryparser/flexible/standard/parser/EscapeQuerySyntaxImpl.java

@@ -184,7 +186,7 @@ public CharSequence escape(CharSequence text, Locale locale, Type type) {
   * Returns a String where the escape char has been removed, or kept only once if there was a
   * double escape.
   *
-   * <p>Supports escaped unicode characters, e. g. translates <code>A</code> to <code>A</code>.
+   * <p>Supports escaped Unicode characters, e.g. translates <code>A</code> to <code>A</code>.


Seems like the comment is trying to say unicode escape sequences are replaced into their characters? Right now it says A->A, which doesn't make sense to me.

Probably QueryParser.jj never rendered this part correctly:
https://github.com/sabi0/lucene/blob/343992fcbb4b31249f07354014723f18d0508d8a/src/java/org/apache/lucene/queryParser/QueryParser.jj#L1071

And then it got "solidified" with LUCENE-1567:
sabi0@343992f#diff-bc1f2f880b43ac551a81b97846a7e7e9119f13de581f6564a35794fa0c78ab36R212

There is another occurrence of this text in the QueryParserBase which tries to "fix" the problem:

* <p>Supports escaped unicode characters, e. g. translates <code>\\u0041</code> to <code>A</code>

But, for instance, IntelliJ still renders that Javadoc wrongly.
Unicode escape-sequence has higher precedence than character escaping?

What do you think of fixing it like this:

* <p>Supports escaped Unicode characters, e.g. translates \<code>u0041</code> to <code>A</code>.

Can you check if the newer {@code xxx} markup works correctly here? Sorry, I'm busy at the moment. Also, I would not worry about intellij - if javadoc produces valid output, intellij has a bug (feel free to report it!). Escape sequences may be translated very early by javac lexer - I'm sure there is a way to escape them though.

I think this gives you a hint on how to escape it in the source -
https://stackoverflow.com/questions/21522770/unicode-escape-syntax-in-java

Interesting, I didn't know about it.

Thank you for the link. I did not know about the \uu... either.

Unfortunately, javadoc seems to swallow all of those 'u's anyway:

<div class="block">Returns a String where the escape char has been removed, or kept only once if there was a double escape. <p>Supports escaped Unicode characters, e.g. translates <code>A</code> to <code>A</code>.</div>

The {@code ...} markup works the same:

<code>\u0041</code> => A <code>\uu0041</code> => A <code>\\u0041</code> => \\u0041 {@code \u0041} => A {@code \uu0041} => A {@code \\u0041} => \\u0041

JDK Javadoc uses Unicode escape for the backslash itself: {@code \u005Cu0800}:
https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/io/DataInput.java#L116

Thank you for investigating. I think javac and javadoc should be consistent here - if they're not, it's worth firing a message to openjdk...

dweiss reviewed Dec 28, 2023

View reviewed changes

sabi0 added 2 commits January 8, 2024 11:07

Cleanup and fix EscapeQuerySyntaxImpl

99fdcf1

Fix Unicode escaping in javadoc

e4ab0a3

dweiss approved these changes Jan 8, 2024

View reviewed changes

dweiss merged commit 0fc1e2c into apache:main Jan 8, 2024
4 checks passed

dweiss added this to the 10.0.0 milestone Jan 8, 2024

dweiss self-assigned this Jan 8, 2024

stefanvodita pushed a commit to stefanvodita/lucene that referenced this pull request Jan 9, 2024

Code cleanups in EscapeQuerySyntaxImpl (apache#12973)

6deb9ba

sabi0 deleted the EscapeQuerySyntaxImpl branch January 9, 2024 08:50

slow-J pushed a commit to slow-J/lucene that referenced this pull request Jan 16, 2024

Code cleanups in EscapeQuerySyntaxImpl (apache#12973)

6756d12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cleanup and fix EscapeQuerySyntaxImpl #12973

Cleanup and fix EscapeQuerySyntaxImpl #12973

sabi0 commented Dec 24, 2023

dweiss Dec 28, 2023

sabi0 Jan 8, 2024

dweiss Jan 8, 2024

dweiss Dec 28, 2023

sabi0 Jan 8, 2024

dweiss Jan 8, 2024 •

edited

Loading

dweiss Jan 8, 2024

sabi0 Jan 8, 2024 •

edited

Loading

dweiss Jan 8, 2024

		for (int i = 0; i < count; i++) {
		result.append(string.charAt(i));

Cleanup and fix EscapeQuerySyntaxImpl #12973

Cleanup and fix EscapeQuerySyntaxImpl #12973

Conversation

sabi0 commented Dec 24, 2023

dweiss Dec 28, 2023

Choose a reason for hiding this comment

sabi0 Jan 8, 2024

Choose a reason for hiding this comment

dweiss Jan 8, 2024

Choose a reason for hiding this comment

dweiss Dec 28, 2023

Choose a reason for hiding this comment

sabi0 Jan 8, 2024

Choose a reason for hiding this comment

dweiss Jan 8, 2024 • edited Loading

Choose a reason for hiding this comment

dweiss Jan 8, 2024

Choose a reason for hiding this comment

sabi0 Jan 8, 2024 • edited Loading

Choose a reason for hiding this comment

dweiss Jan 8, 2024

Choose a reason for hiding this comment

dweiss Jan 8, 2024 •

edited

Loading

sabi0 Jan 8, 2024 •

edited

Loading