-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] Support for open/neutral data formats for engine agnostic reads #12948
Comments
@Bukhtawar - Thanks for the proposal. Format could mean two things here, i) format of the data represented as part of the document, ii) format of the data in rest (compressed and stored). Currently index codec defines both, are you suggesting to change both or just the first one? |
Thanks @backslasht here I intend to keep the data stored at rest in a format that makes it easier for diverse query engines to be plugged in and helps data break free from the Lucene version compatibility constraints as much as possible. Looping @reta @andrross @msfroh @tharejas @sachinpkale @gbbafna for thoughts |
Nice proposal! I am trying to understand scope of this feature request with following questions: For my understanding, is the
Does querying original data from another query engine bypass OpenSearch or this also means OpenSearch support pluggable query engines? |
As far as I remember, the |
+1 . I like the overall idea of decoupling the source from the engine. Couple of questions/thoughts
|
@reta @Bukhtawar There indeed seems to be some overlap here with an ingestion tool like data prepper where you can configure another sink along side OpenSearch and store the data in a neutral, analytics-friendly format. The two use cases listed in this issue ("use any query engine" and "reindex seamlessly") could be solved by ingesting the original data into an additional sink. However, in that case OpenSearch has no knowledge of the other data and cannot use it the way that it uses the |
Thanks @Bukhtawar for the proposal. I definitely see the value of storing _source field in a data format (considering it is just document blob) which is not bound to lucene engine version, especially for re-indexing.. |
Thats true, I don't think you can rely on the _source field, since it can be disabled. |
Obviously we are talking about the new data format which will be applicable for newer version onwards. Based on how the proposal goes we can always decide to change that if we see good benefits espl as OpenSearch has good support for durability but gets constrained on data compatibility |
A few thoughts:
Question: |
Hi @Bukhtawar that's a very interesting suggestion! Some clarification questions to make sure I get it right:
Context: I currently have a working POC in which I extended the |
Is your feature request related to a problem? Please describe
While writing data in Lucene format enables faster queries, it also limits queries to use a compatible Lucene query engine. As data grows over time the need to keep engine compatible with the older data format imposes another constraint, preventing users to choose between getting benefits from newer versions vs keeping older format data readable.
Then in order to upgrade the engine, the data indexed in older formats need to be re-indexed, which requires data to be read from the source field with a compatible Lucene engine before individual documents can be re-indexed into a target version.
Describe the solution you'd like
The
source
field stores the raw doc as a spl field, however this field can only be read by a compatible Lucene version. It be good if we could store this field in open/neutral format. This would enable users toThere could be caveats though with the query performance where actual doc needs to be returned, based on the data format, which needs to be evaluated further
Related component
Storage
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: