[BUG] regexp and wildcard query giving false negatives #12500

john-dicarlo · 2024-02-29T19:37:20Z

Describe the bug

I have some text I know is in one of my documents:

from opensearchpy import OpenSearch
import re

client = OpenSearch(
    hosts = [{'host': host, 'port': port}],
    http_compress = True, # enables gzip compression for request bodies
    http_auth = auth,
    use_ssl = True,
    verify_certs = False,
    ssl_assert_hostname = False,
    ssl_show_warn = False,
)

text = client.get('wiki', 2609)['_source']['body']
print(re.search('celebratory.{0,200}Neuroscience', text)[0])

celebratory responses, laughter, or [[self-serving bias]] in interpreting events.<ref name="SA_NTG">{{cite journal|title=The Neuroscience

Related component

Search

To Reproduce

But doing a regexp query does not give any results:

GET wiki/_search
{
    "query": {
        "regexp": {
            "body": {
                "value": "celebratory.{0,200}Neuroscience"
            }
        }
    }
}

{
  "took": 74,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 0,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  }
}

Nor a wildcard query:

GET wiki/_search
{
  "_source": false,
  "query": {
    "wildcard": {
      "body": {
        "value": "*celebratory*Neuroscience*"
      }
    }
  }
}

{
  "took": 854,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 0,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  }
}

Expected behavior

An intervals query works:

GET wiki/_search
{
  "_source": false,
  "query": {
    "intervals": {
      "body": {
        "all_of": {
          "intervals": [
            {
              "match": {
                "query": "celebratory"
              }
            },
            {
              "match": {
                "query": "Neuroscience"
              }
            }
          ]
        }
      }
    }
  }
}

{
  "took": 13,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 172,
      "relation": "eq"
    },
    "max_score": 0.055860877,
    "hits": [
      {
        "_index": "wiki",
        "_id": "2609",
        "_score": 0.055860877
      },
      {
        "_index": "wiki",
        "_id": "8828",
        "_score": 0.006329179
      },
      {
        "_index": "wiki",
        "_id": "348",
        "_score": 0.005290985
      },
.
.
.

(The score is an order of magnitude higher for the correct document)

Additional Details

Here is what I have:

GET /

{
  "name": "yuzu",
  "cluster_name": "opensearch",
  "cluster_uuid": "5XDkbqB8SXCnjhm99Fj9TA",
  "version": {
    "distribution": "opensearch",
    "number": "2.11.1",
    "build_type": "tar",
    "build_hash": "6b1986e964d440be9137eba1413015c31c5a7752",
    "build_date": "2023-11-29T21:43:10.135035992Z",
    "build_snapshot": false,
    "lucene_version": "9.7.0",
    "minimum_wire_compatibility_version": "7.10.0",
    "minimum_index_compatibility_version": "7.0.0"
  },
  "tagline": "The OpenSearch Project: https://opensearch.org/"
}

Let me know if I'm doing something wrong with regexps and wildcards. I find false negatives particularly concerning.

The text was updated successfully, but these errors were encountered:

msfroh · 2024-02-29T19:59:08Z

So, this comes down to the term-based matching that Lucene (the underlying search library) does.

Your input text gets processed into a stream of tokens, typically one token per word. A token is a "term" (the string content), plus some other attributes, like the position in the stream.

In this case, you can match on celebratory AND Neuroscience if order doesn't matter, like:

{
  "query" : {
    "bool": {
      "must": [
        {
          "term": {
            "body":"celebratory"
          }
        },
        {
          "term": {
            "body":"Neuroscience",
            "case_insensitive": true // By default, the match is case-sensitive
          }
        }
      ]
    }
  }
}

This is equivalent to:

{
  "query" : {
    "query_string": {
      "fields": ["body"],
      "query": "celebratory AND Neuroscience" // No need to worry about case-sensitivity here, because these words get tokenized too
    }
  }
}

If order does matter and you want the words to occur within some distance (like in your original regexp query), you can use a span_near query, though it's not based on text offsets but on token positions:

{
  "query": {
    "span_near": {
      "clauses": [
        { "span_term": { "body": "celebratory" } },
        { "span_term": { "body": "neuroscience" } } // This is case sensitive, I believe
      ],
      "slop": 20, // This is the max distance between token positions of 
      "in_order": true
    }
  }
}

The regexp and wildcard queries are their to match on regexp and wildcards of terms -- not the whole text. So, e.g. you could match on celeb*. The good news in your case is that the Boolean AND query or the span_near query will tend to work much faster than a regexp or wildcard query, because the term-based matching is where Lucene shines.

john-dicarlo · 2024-02-29T20:02:26Z

Sorry, I left out the reasoning behind my question. I have many regular expressions that I want to be able to use as-is and not have to rewrite them as a different type of query. This is only one example. Is this something Lucene can handle, or would I have to rewrite them?

EDIT: Didn't read your last paragraph. Okay, regexp and wildcard only work on terms, not whole text. Is that documented somewhere?

msfroh · 2024-02-29T20:15:57Z

I edited my previous response to add that last paragraph just as you were posting your question, but I apparently anticipated your question to some extent 😁

I suppose you could explicitly mark your field as type keyword, so the full text would be indexed as a single term. That generally has limits on the max term size, though. If your use-case is small enough, it could work.

Ideally, we should add support for a Wildcard field type, like Elasticsearch did. We have an open issue for that at #5639

They have a really good blog post explaining the tradeoffs between the text and keyword field types (and why their wildcard field type is nice) here: https://www.elastic.co/blog/find-strings-within-strings-faster-with-the-new-elasticsearch-wildcard-field

john-dicarlo · 2024-02-29T20:25:16Z

Thanks for this overview of better ways to do this.

The concern I had is that if I did just write a regexp query not knowing how to write the better ones, why do no results come back? Are regexp queries known to be buggy, or just slow? In my world having a query come back silently missing false negatives is a big problem. Could something like a warning/error be raised if the regular expression doesn't work correctly in a regexp query?

The reason I ask all of this is because my company has a hardware product that uses regular expressions for queries and we're trying to figure out what kind of integration with Opensearch makes sense. But it's looking like people don't want to write regexp in Opensearch anyway.

msfroh · 2024-02-29T21:02:08Z

Regexp is a valid use-case, but right now OpenSearch is still built for the classic "tokenized text" behavior.

We really should address #5639 some time. The approach used by Elasticsearch (described in that blog post I linked above) sounds like it does a great job of prefiltering based on trigrams (using Lucene's great conjunctive matching) then doing the expensive wildcard/regexp evaluation on the filtered docs. We should implement something similar.

I would suggest speaking up on #5639 to highlight your use-case as another +1 for that issue.

john-dicarlo · 2024-02-29T21:03:10Z

Thanks for your input. This has been helpful.

john-dicarlo added bug Something isn't working untriaged labels Feb 29, 2024

github-actions bot added the Search Search query, autocomplete ...etc label Feb 29, 2024

github-project-automation bot added this to Search Project Board Feb 29, 2024

github-project-automation bot moved this to 🆕 New in Search Project Board Feb 29, 2024

john-dicarlo closed this as completed Feb 29, 2024

github-project-automation bot moved this from 🆕 New to ✅ Done in Search Project Board Feb 29, 2024

msfroh mentioned this issue Feb 29, 2024

Add support for wildcard field type #5639

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] regexp and wildcard query giving false negatives #12500

[BUG] regexp and wildcard query giving false negatives #12500

john-dicarlo commented Feb 29, 2024

msfroh commented Feb 29, 2024 •

edited

Loading

john-dicarlo commented Feb 29, 2024 •

edited

Loading

msfroh commented Feb 29, 2024

john-dicarlo commented Feb 29, 2024

msfroh commented Feb 29, 2024 •

edited

Loading

john-dicarlo commented Feb 29, 2024

[BUG] regexp and wildcard query giving false negatives #12500

[BUG] regexp and wildcard query giving false negatives #12500

Comments

john-dicarlo commented Feb 29, 2024

Describe the bug

Related component

To Reproduce

Expected behavior

Additional Details

msfroh commented Feb 29, 2024 • edited Loading

john-dicarlo commented Feb 29, 2024 • edited Loading

msfroh commented Feb 29, 2024

john-dicarlo commented Feb 29, 2024

msfroh commented Feb 29, 2024 • edited Loading

john-dicarlo commented Feb 29, 2024

msfroh commented Feb 29, 2024 •

edited

Loading

john-dicarlo commented Feb 29, 2024 •

edited

Loading

msfroh commented Feb 29, 2024 •

edited

Loading