Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

boolean field type should have a parameter hinting at the more common value #11143

Open
msfroh opened this issue Nov 8, 2023 · 0 comments
Open
Labels
enhancement Enhancement or improvement to existing feature or request Search:Performance Search:Query Capabilities

Comments

@msfroh
Copy link
Collaborator

msfroh commented Nov 8, 2023

Is your feature request related to a problem? Please describe.
The boolean field type is essentially just a specialized keyword field type, where the only possible values are true and false. In practice, there are many cases, though, where 90+% of documents have one value or the other: is_deleted:false, is_visible:true, etc.

Because Lucene skips through sparse values much more cheaply (even in a negation), we should (when possible) only index the less common term. A query matching the more common term would be rewritten as a NOT of the less common term. (That is, assuming your documents are almost all is_visible:true, then a query for is_visible:true becomes NOT is_visible:false, since the latter will only need to skip through a small number of matching docs to exclude them.)

Describe the solution you'd like
The boolean field type should accept a parameter that provides a hint saying "This field is usually (true/false)". I don't have a good name for the parameter -- maybe "usually", like:

  "mappings": {
    "properties": {
      "is_visible": {
        "type": "boolean",
        "usually": true
      }
    }
  }

Then we would only index (or write doc values for) false values. As mentioned above, a query for is_visible:true gets rewritten to NOT is_visible:false.

Describe alternatives you've considered
I'm going to write an idea to the lucene-dev mailing list that would count the values when writing a segment and just write the less common value and a DocsEnum for its docs. As you merge segments, you always just write out the doc IDs for the less common value for the resulting segment.

That way, you don't need to provide a hint upfront and the rewrite could be done per segment. (If you have a closer to 50/50 split, you could also do a clever merge that splits the true/false values into different segments, so you have segments that are entirely true or entirely false, such that queries become match-all or match-none.)

Additional context
N/A

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Search:Performance Search:Query Capabilities
Projects
Status: Todo
Status: Later (6 months plus)
Development

No branches or pull requests

3 participants