-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cleanup and fix EscapeQuerySyntaxImpl #12973
Conversation
for (int i = 0; i < count; i++) { | ||
result.append(string.charAt(i)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to add a test for this, since you've found what looks like a bug? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The escapeIgnoringCase
method is private
. It is called in three places, all looking like this:
for (String escapableQuotedChar : escapableQuotedChars) {
buffer = escapeIgnoringCase(buffer, escapableQuotedChar.toLowerCase(locale), "\\", locale);
}
I.e. the input for the search string sequence1
parameter is always controlled and is never an empty string:
private static final String[] escapableTermChars = {
"\"", "<", ">", "=", "!", "(", ")", "^", "[", "{", ":", "]", "}", "~", "/"
};
private static final String[] escapableQuotedChars = {"\""};
private static final String[] escapableWhiteChars = {" ", "\t", "\n", "\r", "\f", "\b", "\u3000"};
(unless some weird locale drops one of those characters when converting to lower case)
I wonder if this whole "empty search string" block should be replaced with an IllegalArgumentException
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not that familiar with this code but I think it'd be good to keep the cosmetic cleanups separate from functional changes - if you don't mind, I'll push this change first, then you can come up with a more focused cleanup?
@@ -184,7 +186,7 @@ public CharSequence escape(CharSequence text, Locale locale, Type type) { | |||
* Returns a String where the escape char has been removed, or kept only once if there was a | |||
* double escape. | |||
* | |||
* <p>Supports escaped unicode characters, e. g. translates <code>A</code> to <code>A</code>. | |||
* <p>Supports escaped Unicode characters, e.g. translates <code>A</code> to <code>A</code>. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like the comment is trying to say unicode escape sequences are replaced into their characters? Right now it says A->A, which doesn't make sense to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably QueryParser.jj
never rendered this part correctly:
https://github.com/sabi0/lucene/blob/343992fcbb4b31249f07354014723f18d0508d8a/src/java/org/apache/lucene/queryParser/QueryParser.jj#L1071
And then it got "solidified" with LUCENE-1567:
sabi0@343992f#diff-bc1f2f880b43ac551a81b97846a7e7e9119f13de581f6564a35794fa0c78ab36R212
There is another occurrence of this text in the QueryParserBase
which tries to "fix" the problem:
* <p>Supports escaped unicode characters, e. g. translates <code>\\u0041</code> to <code>A</code>
But, for instance, IntelliJ still renders that Javadoc wrongly.
Unicode escape-sequence has higher precedence than character escaping?
What do you think of fixing it like this:
* <p>Supports escaped Unicode characters, e.g. translates \<code>u0041</code> to <code>A</code>.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you check if the newer {@code xxx}
markup works correctly here? Sorry, I'm busy at the moment. Also, I would not worry about intellij - if javadoc produces valid output, intellij has a bug (feel free to report it!). Escape sequences may be translated very early by javac lexer - I'm sure there is a way to escape them though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this gives you a hint on how to escape it in the source -
https://stackoverflow.com/questions/21522770/unicode-escape-syntax-in-java
Interesting, I didn't know about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the link. I did not know about the \uu...
either.
Unfortunately, javadoc seems to swallow all of those 'u's anyway:
<div class="block">Returns a String where the escape char has been removed, or kept only once if there was a
double escape.
<p>Supports escaped Unicode characters, e.g. translates <code>A</code> to <code>A</code>.</div>
The {@code ...}
markup works the same:
<code>\u0041</code> => A
<code>\uu0041</code> => A
<code>\\u0041</code> => \\u0041
{@code \u0041} => A
{@code \uu0041} => A
{@code \\u0041} => \\u0041
JDK Javadoc uses Unicode escape for the backslash itself: {@code \u005Cu0800}
:
https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/io/DataInput.java#L116
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for investigating. I think javac and javadoc should be consistent here - if they're not, it's worth firing a message to openjdk...
No description provided.