-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] regexp and wildcard query giving false negatives #12500
Comments
So, this comes down to the term-based matching that Lucene (the underlying search library) does. Your input text gets processed into a stream of tokens, typically one token per word. A token is a "term" (the string content), plus some other attributes, like the position in the stream. In this case, you can match on
This is equivalent to:
If order does matter and you want the words to occur within some distance (like in your original regexp query), you can use a
The regexp and wildcard queries are their to match on regexp and wildcards of terms -- not the whole text. So, e.g. you could match on |
Sorry, I left out the reasoning behind my question. I have many regular expressions that I want to be able to use as-is and not have to rewrite them as a different type of query. This is only one example. Is this something Lucene can handle, or would I have to rewrite them? EDIT: Didn't read your last paragraph. Okay, regexp and wildcard only work on terms, not whole text. Is that documented somewhere? |
I edited my previous response to add that last paragraph just as you were posting your question, but I apparently anticipated your question to some extent 😁 I suppose you could explicitly mark your field as type Ideally, we should add support for a Wildcard field type, like Elasticsearch did. We have an open issue for that at #5639 They have a really good blog post explaining the tradeoffs between the |
Thanks for this overview of better ways to do this. The concern I had is that if I did just write a regexp query not knowing how to write the better ones, why do no results come back? Are regexp queries known to be buggy, or just slow? In my world having a query come back silently missing false negatives is a big problem. Could something like a warning/error be raised if the regular expression doesn't work correctly in a regexp query? The reason I ask all of this is because my company has a hardware product that uses regular expressions for queries and we're trying to figure out what kind of integration with Opensearch makes sense. But it's looking like people don't want to write regexp in Opensearch anyway. |
Regexp is a valid use-case, but right now OpenSearch is still built for the classic "tokenized text" behavior. We really should address #5639 some time. The approach used by Elasticsearch (described in that blog post I linked above) sounds like it does a great job of prefiltering based on trigrams (using Lucene's great conjunctive matching) then doing the expensive wildcard/regexp evaluation on the filtered docs. We should implement something similar. I would suggest speaking up on #5639 to highlight your use-case as another +1 for that issue. |
Thanks for your input. This has been helpful. |
Describe the bug
I have some text I know is in one of my documents:
Related component
Search
To Reproduce
But doing a regexp query does not give any results:
Nor a wildcard query:
Expected behavior
An intervals query works:
(The score is an order of magnitude higher for the correct document)
Additional Details
Here is what I have:
Let me know if I'm doing something wrong with regexps and wildcards. I find false negatives particularly concerning.
The text was updated successfully, but these errors were encountered: