Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] regexp and wildcard query giving false negatives #12500

Closed
john-dicarlo opened this issue Feb 29, 2024 · 6 comments
Closed

[BUG] regexp and wildcard query giving false negatives #12500

john-dicarlo opened this issue Feb 29, 2024 · 6 comments
Labels
bug Something isn't working Search Search query, autocomplete ...etc untriaged

Comments

@john-dicarlo
Copy link

Describe the bug

I have some text I know is in one of my documents:

from opensearchpy import OpenSearch
import re

client = OpenSearch(
    hosts = [{'host': host, 'port': port}],
    http_compress = True, # enables gzip compression for request bodies
    http_auth = auth,
    use_ssl = True,
    verify_certs = False,
    ssl_assert_hostname = False,
    ssl_show_warn = False,
)

text = client.get('wiki', 2609)['_source']['body']
print(re.search('celebratory.{0,200}Neuroscience', text)[0])

celebratory responses, laughter, or [[self-serving bias]] in interpreting events.<ref name="SA_NTG">{{cite journal|title=The Neuroscience

Related component

Search

To Reproduce

But doing a regexp query does not give any results:

GET wiki/_search
{
    "query": {
        "regexp": {
            "body": {
                "value": "celebratory.{0,200}Neuroscience"
            }
        }
    }
}

{
  "took": 74,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 0,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  }
}

Nor a wildcard query:

GET wiki/_search
{
  "_source": false,
  "query": {
    "wildcard": {
      "body": {
        "value": "*celebratory*Neuroscience*"
      }
    }
  }
}

{
  "took": 854,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 0,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  }
}

Expected behavior

An intervals query works:

GET wiki/_search
{
  "_source": false,
  "query": {
    "intervals": {
      "body": {
        "all_of": {
          "intervals": [
            {
              "match": {
                "query": "celebratory"
              }
            },
            {
              "match": {
                "query": "Neuroscience"
              }
            }
          ]
        }
      }
    }
  }
}

{
  "took": 13,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 172,
      "relation": "eq"
    },
    "max_score": 0.055860877,
    "hits": [
      {
        "_index": "wiki",
        "_id": "2609",
        "_score": 0.055860877
      },
      {
        "_index": "wiki",
        "_id": "8828",
        "_score": 0.006329179
      },
      {
        "_index": "wiki",
        "_id": "348",
        "_score": 0.005290985
      },
.
.
.

(The score is an order of magnitude higher for the correct document)

Additional Details

Here is what I have:

GET /

{
  "name": "yuzu",
  "cluster_name": "opensearch",
  "cluster_uuid": "5XDkbqB8SXCnjhm99Fj9TA",
  "version": {
    "distribution": "opensearch",
    "number": "2.11.1",
    "build_type": "tar",
    "build_hash": "6b1986e964d440be9137eba1413015c31c5a7752",
    "build_date": "2023-11-29T21:43:10.135035992Z",
    "build_snapshot": false,
    "lucene_version": "9.7.0",
    "minimum_wire_compatibility_version": "7.10.0",
    "minimum_index_compatibility_version": "7.0.0"
  },
  "tagline": "The OpenSearch Project: https://opensearch.org/"
}

Let me know if I'm doing something wrong with regexps and wildcards. I find false negatives particularly concerning.

@john-dicarlo john-dicarlo added bug Something isn't working untriaged labels Feb 29, 2024
@github-actions github-actions bot added the Search Search query, autocomplete ...etc label Feb 29, 2024
@msfroh
Copy link
Collaborator

msfroh commented Feb 29, 2024

So, this comes down to the term-based matching that Lucene (the underlying search library) does.

Your input text gets processed into a stream of tokens, typically one token per word. A token is a "term" (the string content), plus some other attributes, like the position in the stream.

In this case, you can match on celebratory AND Neuroscience if order doesn't matter, like:

{
  "query" : {
    "bool": {
      "must": [
        {
          "term": {
            "body":"celebratory"
          }
        },
        {
          "term": {
            "body":"Neuroscience",
            "case_insensitive": true // By default, the match is case-sensitive
          }
        }
      ]
    }
  }
}

This is equivalent to:

{
  "query" : {
    "query_string": {
      "fields": ["body"],
      "query": "celebratory AND Neuroscience" // No need to worry about case-sensitivity here, because these words get tokenized too
    }
  }
}

If order does matter and you want the words to occur within some distance (like in your original regexp query), you can use a span_near query, though it's not based on text offsets but on token positions:

{
  "query": {
    "span_near": {
      "clauses": [
        { "span_term": { "body": "celebratory" } },
        { "span_term": { "body": "neuroscience" } } // This is case sensitive, I believe
      ],
      "slop": 20, // This is the max distance between token positions of 
      "in_order": true
    }
  }
}

The regexp and wildcard queries are their to match on regexp and wildcards of terms -- not the whole text. So, e.g. you could match on celeb*. The good news in your case is that the Boolean AND query or the span_near query will tend to work much faster than a regexp or wildcard query, because the term-based matching is where Lucene shines.

@john-dicarlo
Copy link
Author

john-dicarlo commented Feb 29, 2024

Sorry, I left out the reasoning behind my question. I have many regular expressions that I want to be able to use as-is and not have to rewrite them as a different type of query. This is only one example. Is this something Lucene can handle, or would I have to rewrite them?

EDIT: Didn't read your last paragraph. Okay, regexp and wildcard only work on terms, not whole text. Is that documented somewhere?

@msfroh
Copy link
Collaborator

msfroh commented Feb 29, 2024

I edited my previous response to add that last paragraph just as you were posting your question, but I apparently anticipated your question to some extent 😁

I suppose you could explicitly mark your field as type keyword, so the full text would be indexed as a single term. That generally has limits on the max term size, though. If your use-case is small enough, it could work.

Ideally, we should add support for a Wildcard field type, like Elasticsearch did. We have an open issue for that at #5639

They have a really good blog post explaining the tradeoffs between the text and keyword field types (and why their wildcard field type is nice) here: https://www.elastic.co/blog/find-strings-within-strings-faster-with-the-new-elasticsearch-wildcard-field

@john-dicarlo
Copy link
Author

Thanks for this overview of better ways to do this.

The concern I had is that if I did just write a regexp query not knowing how to write the better ones, why do no results come back? Are regexp queries known to be buggy, or just slow? In my world having a query come back silently missing false negatives is a big problem. Could something like a warning/error be raised if the regular expression doesn't work correctly in a regexp query?

The reason I ask all of this is because my company has a hardware product that uses regular expressions for queries and we're trying to figure out what kind of integration with Opensearch makes sense. But it's looking like people don't want to write regexp in Opensearch anyway.

@msfroh
Copy link
Collaborator

msfroh commented Feb 29, 2024

Regexp is a valid use-case, but right now OpenSearch is still built for the classic "tokenized text" behavior.

We really should address #5639 some time. The approach used by Elasticsearch (described in that blog post I linked above) sounds like it does a great job of prefiltering based on trigrams (using Lucene's great conjunctive matching) then doing the expensive wildcard/regexp evaluation on the filtered docs. We should implement something similar.

I would suggest speaking up on #5639 to highlight your use-case as another +1 for that issue.

@john-dicarlo
Copy link
Author

Thanks for your input. This has been helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Search Search query, autocomplete ...etc untriaged
Projects
Archived in project
Development

No branches or pull requests

2 participants