-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Parquet/Avro Storage Extension With External Writer #13668
Comments
Can you be more specific about what you mean here? I know there are the
Plugins are loaded in a separate classloader partly to avoid these problems, as I understand it. For example, the At first glance, it seems to me that the existing plugin framework should be able to work for this case (meaning the overall framework, not necessarily that the existing interfaces provide the exact extension points you need). Do you have a strong requirement to run this as a non-JVM runtime? |
I think a good way to illustrate the issue is if we look at the KNN plugin. It's trying to extend the knnVectorsFormat in the Lucene interface to a native codec implementation. For that it has to override Lucene consumer/producer interface of the knnVectorsFormat. For reference: here
The JarHell tool will scan the plugin dependencies and look for collisions. Sometimes the dependencies themselves have internal collisions. But not much you can do about it. At least that's what I noticed when I was trying to include
True, I commented in the proposal as well that there is no technical limitation at the moment that mandates an external runtime for this plugin. Given sufficient time and effort it is possible to fix most issues and get it to work. |
I think JarHell is telling you that there are two implementations of the same fully qualified class name on the classpath. You would want to fix this regardless of how you were running that JVM since classloading is non-deterministic in this case.
I would need more details on the specific implementation of this approach to really comment on it (please do share some version of your POC, even if only the OpenSearch changes, if you can). But I would expect that you'd have to introduce some kind of extension point into the core here, and then provide a no-op implementation for it by default, and optionally the IPC version of it that talks to an external process. It seems this might look a lot like the existing plugin framework? And once you have the plugin extension point you could provide an IPC-based implementation or an in-JVM implementation of it. |
While I agree this is the right practice. Unfortunately if two implementations of the same class in the underlying dependency it's not always easy to fix without modifying its source code.
I understand it could be hard to visualize. I will try to provide a draft of the POC pretty soon to show the example. |
Can you also expand a bit on how the overall feature would be used? Namely, how does this approach compare to writing Parquet files in tandem with indexing into OpenSearch at ingestion time by using a tool like Data Prepper? |
Thanks @samuel-oci. I definitely like the idea as it is almost inline with #12948. However, like @andrross I am curious too as to what does the extension point look like based on the issues you describe you ran into while setting up the writer. Also I think if we can make one open format work for search queries requiring source fields, I see that as a win since we might end up saving some redundant storage cost. |
@andrross @Bukhtawar Perhaps I didn't make it too clear earlier, but one comment I forgot to put in the description regarding the motivation. If in the future we have Lucene extension in Python/Rust etc.. I believe integration on the storage encoding level can use a similar solution. I edited now and added it in the advantages section in the description. |
Is your feature request related to a problem? Please describe
I am interested in extending DocValues and StoredFields codecs to using a format such as Parquet or Avro.
The main reasoning behind it is that those are highly popular formats that can be easily read by other popular projects such as Apache Spark.
There are currently a number of ways to do so:
JarHell
and other runtime issues that are making this process very difficult and could interfere with other plugins as well.Describe the solution you'd like
I have created a POC that seems to work well in extending to both Avro and Parquet by leveraging the approach of external writer that is spawn by the main OpenSearch engine. The engine is communicating to the external writer via IPC based on system sockets. This presents a number of advantages:
Related component
Storage
Describe alternatives you've considered
I have considered extending the core engine interface itself to be format agnostic (not dependent on Lucene APIs). However the engine interfaces are tightly bound to the Lucene spec. For example it relies on
segmentInfos
etc.. Since those APIs are generic enough I didn't see a need in replicating those into a non Lucene API spec.Additional context
No response
The text was updated successfully, but these errors were encountered: