-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Data Access Object Interface for Metadata Store #13336
Comments
@fddattal is also working on similar[1] lines to abstract out plugins interfacing with core for state. |
While plugins may not need full-text search capability, some search features are nice to have and could simplify existing plugin migration. Could we define the minimal search capability that could be easily implemented across different backend storages? |
Thanks for creating this RFC, few follow up questions:
|
We would probably want to include a listing feature ("search all" equivalent). If we had keyword-based fields we could probably include a limited number of them as filtering of the list.
This somewhat relates to @xinlamzn 's comment above. We need some sort of basic search; the problem is setting performance expectations. NoSQL DBs are optimized for key-value lookup, not for searching across all the data.
Possibly! I'd think things that are included already in OpenSearch would be prime candidates. A Java Client implementation for a remote cluster would probably have wide applicability. I'm not sure if Remote Storage is the right candidate for it but it could be explored. Beyond those, I'd expect it'd be driven by community requests. |
Looking at #13274, we've got the same goals and would benefit from the same implementation. Currently leaning toward the interface in this comment: #13274 (comment) Not all plugins will need all interfaces implemented, so will need to consider default implementations or an abstract base class with no-op implementations. |
Implementation using public class XContentClient implements Client {
private final org.opensearch.client.Client client;
public XContentClient(org.opensearch.client.Client client) {
this.client = client;
}
@Override
public CompletionStage<PutCustomResponse> putCustom(PutCustomRequest request) {
CompletableFuture<PutCustomResponse> future = new CompletableFuture<>();
try (XContentBuilder sourceBuilder = XContentFactory.jsonBuilder()) {
client.index(
new IndexRequest(request.index()).setRefreshPolicy(IMMEDIATE)
.source(request.custom().toXContent(sourceBuilder, ToXContent.EMPTY_PARAMS)),
ActionListener.wrap(
r -> future.complete(
new PutCustomResponse.Builder().id(r.getId()).created(Result.CREATED.equals(r.getResult())).build()
),
future::completeExceptionally
)
);
} catch (IOException ioe) {
// Parsing error
future.completeExceptionally(ioe);
}
return future;
}
} |
@dbwiddis What is the basic idea for plugins to migrate from their existing solution to using this? Will data movement be required (i.e. read all data from cluster state or system index being used today and rewrite it into this interface), or can this be a new facade on top of the existing data? Regarding existing plugins migrating to this and the questions around search, have you done an inventory of the current plugins that exist within the OpenSearch project and their usage of metadata storage? Basically, I would tend to agree with your statement "plugin metadata tends to be more narrowly defined and include more key-value storage or a limited number of documents primarily referenced by their document ID" but it would be great if we can back that assertion up with some real numbers. |
Funny you should ask as I was working on this Draft PR, that I hope shows a possible migration path. Open to feedback! opensearch-project/ml-commons#2430
The above PR maintains the data in the existing system index, no movement required. However, my next trick will be to use the OpenSearch Java Client to demonstrate the ability to read/write that same data to a remote cluster, and my previous POC would allow that same data to be read/written to a NoSQL store. Options abound. TLDR: for existing cluster implementation it's a facade.
Broadly, yes, but I have not done a detailed look, however; for now I'm focusing on flow framework and ML Commons.
I'll definitely add that to my long to-do list! |
|
Just to clarify, should the plugins manage the persistent storage alternatives? Or should plugins just talk with the DAO interface and not care what the persistent storage is? (I hope the latter.) Or are you saying that the DAO implementations could be plugins? (I hope so.) |
"The plugins" here is overly broad, as plugins will still have a choice in the storage alternative, however that will be more of a configuration-level "management". The vision here is:
Maybe. Probably. I'm not sure yet. That's why this RFC. But this is conceptually similar to the At this point, the DAO implementations are a single class, so creating a whole plugin around them feels like overkill. They probably do at least need to be in separate maven artifacts. |
@msfroh @dbwiddis I think the telemetry effort might be a good parallel here: There is a TelemetryPlugin interface that allows injecting an implementation for transmitting/storing telemetry data to whatever other system you want. The server defines interfaces for emitting metrics (e.g. Tracer, MetricsRegistry) and wires up the implementation provided via TelemetryPlugin. Core pieces of the server now emit metrics through those interfaces. And finally, a TelemetryAwarePlugin interface was introduced to expose Tracer and MetricsRegistry to plugins so that they can emit metrics themselves. So following that pattern, a new plugin would allow injecting an implementation for storing metadata into the core. The default could be cluster state and/or system indexes, but a plugin could just as easily provide an implementation for an external store. The core would define some new interface for reading/writing this metadata (I think this would look something like your Dao class) and wire up the appropriate implementation. Core features (such as search pipelines) would use this interface for reading and writing its metadata. And finally this new interface would be exposed to plugins for their own metadata storage needs. One final point, @dbwiddis is coming at this from the perspective of plugin extensibility, but I believe this work also aligns with breaking apart the monolithic cluster state which is an impediment to scaling large clusters. |
There are more use case like Remote store using a metadata store and we really need to think about the store capabilities for instance optimistic concurrency control/conditional updates/transactions which is something missing in the |
@andrross -- Yes! That's exactly the approach that I was thinking of -- pluggable metadata storage, where the producers/consumers don't care how it works or where the metadata is stored, just that it honors some interface. @Bukhtawar -- I don't think @dbwiddis is suggesting that the |
@msfroh Thanks! I believe @Bukhtawar is saying that the current Remote Store implementation uses the Repository interface for storing metadata, and that it is a bad fit for that use case. So Remote Store is yet another feature that would be interested in using the new metadata store feature. But just like there are questions around what search features should the metadata store support, we'd need to figure out what transactional and concurrency control features it should support as well. |
Very interesting idea indeed. But if there will be any (meta)data stored in external systems then I think we also need to think about identity management that needs to be part of the communication with external stores. Should this topic be part of this proposal as well? The simplest example would be just a basic read/write permissions configuration for an external store (assuming the external system requires user authn/authz). Will that be part of OpenSearch configuration? Or specific configuration of particular metadata store implementation? Or will this be left on external systems to handle? (Ie. for example users will need to setup a proxy server in front of the external store to handle this) |
@lukas-vlcek I think identity management would work similarly to how the repository plugins work in OpenSearch. Each implementation for any given remote metadata storage system would be responsible for defining how its specific credentials are provided, and the operator would be responsible for giving credentials with the necessary permissions for OpenSearch to interact with this system. |
@andrross @lukas-vlcek @Bukhtawar Circling back to this after several weeks of trying to migrate a plugin to use this. I'm a bit concerned that we may be trying to over-generalize "metadata" and store arbitrary things. We actually do have arbitrary blob storage in various locations with an interface. But:
These two abstractions already store very specific types of data and have very different interfaces. System indices store documents just like all of OpenSearch, and plugins expect the usual CRUD-S operations to work on them just like they always do. Sure we can put that document (conceptually a JSON string) anywhere. Cluster state is completely different in how it operates, but it's very consistent with a different interface. It's difficult to combine them. |
Is your feature request related to a problem? Please describe
OpenSearch Plugins require persistent storage of some configuration and metadata related to their operation, which are logically different than the end user data storage. Currently, plugins use ether cluster state or system indices to manage this information.
Unlike typical data designed for optimized search over a large scale, plugin metadata tends to be more narrowly defined and include more key-value storage or a limited number of documents primarily referenced by their document ID. Plugins should be able to define other persistent storage alternatives than the cluster state or system indices. Options include remote clusters, non-relational databases (MongoDB, NoSQL, DynamoDB, Apache Cassandra, HBase, and many others), Blob Storage (see Remote Cluster State RFC #9143).
However, the plugin code should ideally be the same, with multiple alternate implementations of those interfaces implementing the logical separation of code from data.
Defining a standard Data Access Object (DAO) interface pattern can provide a path for plugins to migrate existing code with a cluster-based implementation, while providing future flexibility for using those plugins with other storage locations.
Note: the scope of this proposal is for key-value or id-based lookups. It specifically excludes data that requires efficient searching.
Describe the solution you'd like
Here is an example interface proposal:
Existing NodeClient implementations in plugins could be migrated to this XContent-based implementation (partially implemented example for Flow Framework):
A remote cluster using the OpenSearch Java Client could use this style interface (partial implementation shown):
CRUD implementation for NoSQL data stores would use the appropriate request/response model for those data stores, which largely align with the interface.
Related component
Plugins
Describe alternatives you've considered
Let each plugin define its own interface
This will likely be an intermediate solution further explored while this RFC is open for comments. It may be a reasonable solution as some plugins might require more arguments than this limited proposal. However, this would cause some duplication of effort and result in more copy/paste implementations.
Define the interface and base implementations either in core/common or a new repository to be used as a dependency
Given the small number of classes, this seems like a lot of overhead compared to including in an existing common dependency.
Wrapper Client
I explored using a wrapper client implementing the
NodeClient
for CRUD operations in this Proof-of-Concept PR opensearch-project/flow-framework#631 . This allowed a minimum of changes in the plugin code, easing migration. However, it also required translation from the NodeClient Request objects to the specific implementation Request Objects (easy) and from the implementation Response Objects to generate NodeClient Response objects (required creating non-relevant data related to shards).The proposed interface minimizes this extra object creation.
Inline Code
Each code point interacting with the index could have multiple conditional implementations. This complexity mixes code and data and makes migration difficult.
Additional context
This proposal is intentionally narrowly scoped to key-value or id-based CRUD operations and excludes search operations.
The only code proposed to be added to OpenSearch itself is the interface definition and possibly an abstract base class with some common NodeClient implementations. It could reside in any package commonly loaded by plugins.
The text was updated successfully, but these errors were encountered: