Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup and fix EscapeQuerySyntaxImpl #12973

Merged
merged 2 commits into from
Jan 8, 2024
Merged

Cleanup and fix EscapeQuerySyntaxImpl #12973

merged 2 commits into from
Jan 8, 2024

Conversation

sabi0
Copy link
Contributor

@sabi0 sabi0 commented Dec 24, 2023

No description provided.

for (int i = 0; i < count; i++) {
result.append(string.charAt(i));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to add a test for this, since you've found what looks like a bug? Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The escapeIgnoringCase method is private. It is called in three places, all looking like this:

    for (String escapableQuotedChar : escapableQuotedChars) {
      buffer = escapeIgnoringCase(buffer, escapableQuotedChar.toLowerCase(locale), "\\", locale);
    }

I.e. the input for the search string sequence1 parameter is always controlled and is never an empty string:

  private static final String[] escapableTermChars = {
    "\"", "<", ">", "=", "!", "(", ")", "^", "[", "{", ":", "]", "}", "~", "/"
  };

  private static final String[] escapableQuotedChars = {"\""};

  private static final String[] escapableWhiteChars = {" ", "\t", "\n", "\r", "\f", "\b", "\u3000"};

(unless some weird locale drops one of those characters when converting to lower case)

I wonder if this whole "empty search string" block should be replaced with an IllegalArgumentException?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not that familiar with this code but I think it'd be good to keep the cosmetic cleanups separate from functional changes - if you don't mind, I'll push this change first, then you can come up with a more focused cleanup?

@@ -184,7 +186,7 @@ public CharSequence escape(CharSequence text, Locale locale, Type type) {
* Returns a String where the escape char has been removed, or kept only once if there was a
* double escape.
*
* <p>Supports escaped unicode characters, e. g. translates <code>A</code> to <code>A</code>.
* <p>Supports escaped Unicode characters, e.g. translates <code>A</code> to <code>A</code>.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the comment is trying to say unicode escape sequences are replaced into their characters? Right now it says A->A, which doesn't make sense to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably QueryParser.jj never rendered this part correctly:
https://github.com/sabi0/lucene/blob/343992fcbb4b31249f07354014723f18d0508d8a/src/java/org/apache/lucene/queryParser/QueryParser.jj#L1071

And then it got "solidified" with LUCENE-1567:
sabi0@343992f#diff-bc1f2f880b43ac551a81b97846a7e7e9119f13de581f6564a35794fa0c78ab36R212

There is another occurrence of this text in the QueryParserBase which tries to "fix" the problem:

   * <p>Supports escaped unicode characters, e. g. translates <code>\\u0041</code> to <code>A</code>

But, for instance, IntelliJ still renders that Javadoc wrongly.
Unicode escape-sequence has higher precedence than character escaping?

What do you think of fixing it like this:

   * <p>Supports escaped Unicode characters, e.g. translates \<code>u0041</code> to <code>A</code>.

Copy link
Contributor

@dweiss dweiss Jan 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check if the newer {@code xxx} markup works correctly here? Sorry, I'm busy at the moment. Also, I would not worry about intellij - if javadoc produces valid output, intellij has a bug (feel free to report it!). Escape sequences may be translated very early by javac lexer - I'm sure there is a way to escape them though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this gives you a hint on how to escape it in the source -
https://stackoverflow.com/questions/21522770/unicode-escape-syntax-in-java

Interesting, I didn't know about it.

Copy link
Contributor Author

@sabi0 sabi0 Jan 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the link. I did not know about the \uu... either.

Unfortunately, javadoc seems to swallow all of those 'u's anyway:

<div class="block">Returns a String where the escape char has been removed, or kept only once if there was a
 double escape.

 <p>Supports escaped Unicode characters, e.g. translates <code>A</code> to <code>A</code>.</div>

The {@code ...} markup works the same:

<code>\u0041</code>   => A
<code>\uu0041</code>  => A
<code>\\u0041</code>  => \\u0041

{@code \u0041}   => A
{@code \uu0041}  => A
{@code \\u0041}  => \\u0041

JDK Javadoc uses Unicode escape for the backslash itself: {@code \u005Cu0800}:
https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/io/DataInput.java#L116

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for investigating. I think javac and javadoc should be consistent here - if they're not, it's worth firing a message to openjdk...

@dweiss dweiss merged commit 0fc1e2c into apache:main Jan 8, 2024
4 checks passed
@dweiss dweiss added this to the 10.0.0 milestone Jan 8, 2024
@dweiss dweiss self-assigned this Jan 8, 2024
stefanvodita pushed a commit to stefanvodita/lucene that referenced this pull request Jan 9, 2024
@sabi0 sabi0 deleted the EscapeQuerySyntaxImpl branch January 9, 2024 08:50
slow-J pushed a commit to slow-J/lucene that referenced this pull request Jan 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants