Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should we consider adding support for taxonomy indices? #11355

Open
msfroh opened this issue Nov 27, 2023 · 1 comment
Open

Should we consider adding support for taxonomy indices? #11355

msfroh opened this issue Nov 27, 2023 · 1 comment
Labels
discuss Issues intended to help drive brainstorming and decision making enhancement Enhancement or improvement to existing feature or request Search:Aggregations

Comments

@msfroh
Copy link
Collaborator

msfroh commented Nov 27, 2023

Is your feature request related to a problem? Please describe.
In one of our OpenSearch Lucene Study Group Meetings, we talked about improvements made to Lucene's faceting with taxonomy indices. A question that came up was whether it makes sense to add support for taxonomy indices to OpenSearch. I promised to create an issue to discuss it, so here we are.

Background

More broadly, it's not that we would consider adding support for taxonomy indices, but rather we could try leveraging Lucene's facets module: https://lucene.apache.org/core/9_8_0/demo/org/apache/lucene/demo/facet/package-summary.html.

Lucene's facets satisfy a similar niche to aggregations in OpenSearch. For historical reasons unknown to me, aggregations were implemented as a separate thing, unrelated to Lucene's facet module. Currently, OpenSearch depends on most Lucene modules, but not facets.

Pros

  • If we can push logic down into Lucene, it's logic that OpenSearch doesn't have to worry about as much. There are some very smart people working on Lucene's facets implementation so leveraging their work would be nice. If we contribute, we get perspectives from outside of OpenSearch.
  • If we use taxonomy indices, I believe it effectively pushes the computation of global ordinals to indexing time, which takes some work off from search time. My understanding is that the OrdinalMap instances occupy heap for every possible value (mapping from per-segment values to a global ordinal value that works for the whole shard).
  • The facets module can work without taxonomy indices by making use of existing doc value fields, so we're able to leverage its functionality without needing to use taxonomy indices (but we do pay the OrdinalMap price instead).

Cons

  • While it may be a bit of a sunk cost fallacy, we have a lot invested in the existing aggregations implementation. Moving it all to Lucene's facets could be quite a large effort. While we wouldn't need to do it all at once, there is some cognitive burden associated with having some aggregations implemented with Lucene facets while others are not.
  • A lot of the benefit (I believe) would come with enabling index-time taxonomy fields and the associated taxonomy index. That adds a fair bit of complexity that we don't currently have -- we need to manage two writers and readers. There is a SearcherTaxonomyManager that simplifies some of this, but it wouldn't be a small change. I'm a little scared to think about what managing two Lucene indices per shard would mean for segment replication.

What's next?

The above are just my opinions of arguments for and against using the Lucene facets module. I would love to move the heavy lifting of aggregations out of OpenSearch, but I also see it as a huge effort with potentially little payback.

What do y'all think?

@msfroh msfroh added enhancement Enhancement or improvement to existing feature or request discuss Issues intended to help drive brainstorming and decision making untriaged Search:Aggregations labels Nov 27, 2023
@msfroh msfroh removed the untriaged label Jan 31, 2024
@reta
Copy link
Collaborator

reta commented Aug 14, 2024

Recently merged improvements apache/lucene#13568

@getsaurabh02 getsaurabh02 moved this from 🆕 New to Later (6 months plus) in Search Project Board Aug 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Issues intended to help drive brainstorming and decision making enhancement Enhancement or improvement to existing feature or request Search:Aggregations
Projects
Status: Later (6 months plus)
Development

No branches or pull requests

2 participants