-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Reader and Writer Separation in OpenSearch #7258
Comments
Tagging @nknize @andrross @dblock @sohami @Bukhtawar @muralikpbhat for feedback |
I love the goals in this proposal. Separating reads and writes is a huge step forward and creating a node role that's purely compute for search makes a lot of sense. Each search has a fetch and reduce phase. If I understand correctly you are proposing to create lagging replicas and move the fetch part to them. The reduce part would need to continue to be executed with no changes today. The node that is coordinating the search would continue to query results from all the shards via the search-replicas and then order the query results. Questions:
|
@dblock : Thanks for the feedback. You brought in valuable points which are not captured in detail.
Yes, you are right. The reduce phase would run on the Coordinator. Today, Coordinator role can run on dedicated nodes or co-located with data. The same logic would apply here as well where coordinator role can be co-located with data and search role. This brings the point can we separate the coordinators as well for search and indexing. Yes definitely, I had kept it out of the discussion for now but make sense to capture it. I will update the proposal as well. One way to solve this is to Introduce a thin proxy in OpenSearch which only handles routing the traffic to indexing/ search coordinator. Now, these coordinators could be dedicated or co-located with data or search role. User can choose to run this request routing (using thin proxy) out of OpenSearch as well and forward the request to appropriate coordinators.
The reduce phase could be memory intensive e.g. an aggregation query aggregating over large no. of buckets. Yes, we can definitely think about separating the coordinators as captured above.
Yes, that could be a configuration option where user decides if all search-replicas should always remain in sync (same as today with replicas) or can lag for x duration. With this separation, we can offer this flexibility to the user. Supporting only majority of them could be tricky as today primary doesn't fail write in case replicas didn't respond and rather fail the replica which didn't respond and ack the request. One of the motivation for not keeping search-replica strictly in sync was to decouple indexing from search-replicas completely so that indexing latencies are not impacted in case search-replica can't write at the same pace. Also, prevent aggressively failing search-replicas if they are not in sync with primary.
Leader node will always track the search-replica to ensure it is synced up either always in sync with primary vs or lets say within x duration from the primary. But, there can't be redirection. In case any of the search-replica is out-of-date, it should go unassigned and coordinator will not send any search request to that search-replica. |
What I like about the separation is that the architecture is effectively a Quorum Consensus one and that's the direction we should move. What I would think differently about is the following:
Node roles are becoming a sledge hammer for "separating" everything ( |
@nknize brings very valid points on the explosion of number of nodes - as a middle ground, would it make sense to be able to have fewer read-only replica nodes than data nodes? in the example of 5 data nodes, could a user bring up just 2 additional "search-replica" nodes to improve search performance? I think the answer is yes. But then if we continue this thought, then we could dynamically mark compute-only nodes as "search-replica" and assign N nodes to M shards to replicate and to be used as search based on needs at runtime. |
Yes. However, the main point I'm making is that a node should be able to serve multiple roles at any given time and we shouldn't bias towards "extreme separation". As I mention in #6528 (comment) If the write load spikes then we start simple by using existing mechanisms such as |
@shwetathareja What is the behavior here if a primary fails and the only remaining copy of the shard is on a search-replica (i.e. no remote-backed storage and no healthy write-replica exists)?
One of the proposed benefits here is that specialized hardware for a given role will allow a user to do more work with fewer resources. If hardware is tailored for search or indexing, then you'll need some mechanism to ensure that hardware only gets the workload it is optimized for. Any cloud user has enormous flexibility in choosing hardware configurations and could plausibly benefit from such an architecture. It would probably be super useful to have some prototype numbers to get real data on the potential for cost savings here.
Another proposed benefit is predictable performance, which is a little bit at odds with autotuning. Its great when your cluster can automatically adapt and optimize for the given workload, but it can make it challenging to know the limits of your system, when to scale what, how it will respond when overloaded, etc. |
💯 Reproducibility becomes much more complicated and trace logging becomes much more important. Although I think standard rules still apply. If even with autotuning you see pathological replication lag you're still going to look at thread pools, memory & CPU pressure, along w/ the parameters at the time of query rejection. But I think it's here that our existing Constraint mechanisms play a big part in the cluster insights and evasive actions taken. Take Disk watermarks. If low watermark is hit on the primary, in today's implementation, you can expect the node to take aggressive invasive action and forbid new shard allocations on that node. It's at this point your autoscaling configuration matters. i.e., it could leverage it's remote store weapons to take evasive action like your "roll off" scenario described in #6528 and invoke a "cache by query" mechanism that "swaps" local segments to warm storage to alleviate disk pressure on that node. |
Thanks @nknize for the feedback and bringing these interesting points around auto-tune/ consensus based read/ writes.
The key benefit of this “extreme separation” is isolation. And with different node roles, it becomes easy to provide the isolation between the workload. I agree, there are too many node roles already. We can definitely look into resource isolation within the JVM for different operations like indexing/ search etc, but that is still a thought and would need to evaluate how far can we go with that.
Today, also customers benchmark to find our their capacity in terms of data nodes. With Indexing/ Search separation, they can plan out for their capacity better and can also use specialised hardware which was not possible earlier.
I think we should design for giving the flexibility to the user. The trade offs are not same for all users. There would be users who are looking for predictable performance in terms of latency and throughput and complete isolation of workloads.
I feel auto tune is separate problem than providing workload isolation. I love the idea of auto tune in general. Auto-tune would help users to use their resources more optimally. Also using remote storage as the swap space whenever nodes are in panic (due to low disk space) and have system automatically adapt to such needs. But, we can achieve auto tune where indexing and search are co-located and served from same shard.
Isn’t Quorum Consensus Architecture achievable without any separation at all with just the “data” nodes today and same shard participating in quorum for both read and write. Also, with remote store, consensus based writes might become less attractive if remote store is able to provide consistency and durability guarantees. In general, I feel we need to start again here. What is our goal with Indexing/ Search separation? What are the customer use cases we are trying to solve here? What are future improvements which require this separation and not possible with current architecture where indexing and search are co-located on the same node or same shard is serving both read and write. |
Good point @andrross . In that case ideally, write copies will turn red. Now, based on the in sync allocations (checkpoint) leader can identify if it was up-to-date with primary or not. In case it was up-to-date, it can be used to resurrect/ promoted to primary. If it is not up-to-date, then there would be data loss and hence user can manually choose to resurrect the primary from search-replica or promote it acknowledging the data loss. |
First I think we should empirically investigate what level of separation is really needed before making code changes to this end. I strongly feel we have most of the mechanisms in place to maximize cluster resources and throughput at minimal cost to the user before drawing a conclusion that node separation is really needed to this degree. So if we define "separation" (as I think this RFC does?) as separating the nodes through strict roles (vs separating From this RFC I see a couple mechanisms I think we should explore (without going so far as separating the roles):
Some of the other mechanisms mentioned are already being worked (e.g., streaming index API is working on tuning buffers / caching during writes based on load). |
I'd keep this simple. Scenario: indexing back pressure on the primary node trips some constraint jeopardizing indexing throughput. Contingency measure one:
Contingency measure two (adaptive shadow writing):
In this way we use existing idle nodes to scale writes. If Adaptive Selection indicates cluster red-lining, then we recommend the user adds a new node. A cloud hosting environment could spin up a new node for them, but I'd put that logic in a This also enables us to separate functional mechanisms from segrep into the core library such that cloud native opensearch integrators could implement a pure serverless solution (separate from the "serverful" |
Thinking deeper about this, if we implement the shadow writers by teasing out the segrep mechanisms (e.g., copySegment) and components to |
@nknize 💯 to supporting the indexing, search, merge as stateless function where each one can scale independent of each other and would be a big step towards enabling OpenSearch serverless offering by different cloud providers. It requires more brainstorming on how would it work end-to-end e.g. how failure handling would work e.g. if one shadow writer crashes, coordination between primary shard and shadow writer which are creating the segments and copying back to primary shard, does it require changes in checkpointing mechanism, how updates & deletes are handled etc. In general, I agree to start with end state where these functions are stateless, how would management flow look like for serverless and serverful and what abstraction would be useful for both and what should be in core vs cloud-native plugin/ extension. My thoughts on your previous comments
Thanks for sharing the add back search preference discussion. This is exactly what we would have implemented anyway as part of this proposal. I think the RFC looks big as we are trying to explain the proposal in detail but overall changes would be on top of existing constructs like shard allocation filters which would need changes to start differentiating primary and replicas. Now, whether these changes should be part of core vs cloud-native plugin/ extension is open for discussion. I totally agree with you that we should experiment and get data before we conclude in this direction in terms of both performance and cost benefits. This RFC was to get initial buy-in whether we should invest in this direction or not.
In case the node resources are not exhausted and Indexing backpressure is triggered lets say due to indexing queue, then parallel writers would help scale the indexing work and increase the throughput. But, if back pressure is applied to due to JVM memory pressure/ cpu/ storage then having parallel writer on the same node wouldn't make it any better. +1 to Adaptive Shadow writing for scaling writes to other nodes. It will work for append-only workload. |
Mechanisms exist to handle these failure situations. Segment copy failures are handled w/ shadow writers the same as done today in segrep when copying primary segments to replicas. Out of order is handled through sequence numbers (and primary term in case of primary failure). In segrep we still forward the op for the translog for failover so that would remain in place. In the second use case the primary reports back to the coordinating node to route the next batch of operations to the shadow writer. Ordering is handled using the xlog sequence number where the starting sequence number of the next batch of operations are sent to the shadow writer. Lucene also maintains sequence numbers so
ping @mattweber, are you still planning to commit the revert on this elastic PR?
Exactly! We can leverage Constraints the same way we do w/ |
Yes, I see value in bringing the change back while we brainstorm and converge on the long term plan. One option that I am wondering is without supporting explicit "search" node role (until we close on it), have search preference for primary/ replica, shard allocation filtering based on primary/ replica and also better visibility in cluster health for replicas so that users (be it cloud provider or self hosted) can start to separate the workload and see performance benefits. These won't be big core changes. Do you see concerns with it @nknize ? |
Agreed. When I read "node role" I imply that "node can have any combination of roles (or only one)". AFAIK, while some older roles were not designed this way, all new roles (e.g. dynamic roles) are. I think we're saying the same thing.
I believe that the goal is for users to be able to reason about performance, and scale indexing and search independently. For example, I'd like to be able to guarantee a stable level of quality of service for search that I can do math around with a calculator regardless of the indexing pattern: given XTB of data and a well understood search pattern, I want to reliably know that my cluster can perform N req/s with P100 of 10ms, therefore to support M users I need K search nodes while indexing may run in bursts. In this scenario I also would like to be able to measure traffic and start rejecting requests when the number is > N req/s to keep my QoS stable rather than degrade the search performance for everyone by peanut buttering the load across all search nodes. |
You can do this through |
Can you? You have a writer eating away CPU, pegging at 100% in bursts, it makes it very difficult to have predictable reader performance. |
@shwetathareja I like the approach of using these low-level knobs to prove out the behavior and performance we're looking for. The third aspect (beyond search routing and shard allocation) is translog replication (both actual replication used in pure segrep and primary term validation/no-op replication used with remote translog). Can toggling this behavior be inferred from the the allocation filtering (i.e. a node that is excluded from being promoted to primary can skip any translog replication behavior) or would another knob be necessary? |
@shwetathareja I see, with this "soft" separation the replicas nodes can still be promoted to primary (and temporarily serve writes?) but will be quickly relocated to a node designated as the preferred location for primaries, hence the replicas still have to keep all the data and functionality required to act as a primary. |
Adaptively add a ghost writer to a new node: I really like this approach but I am nervous about the few guarantees that it will break for existing APIs. In theory, this is similar to hinted handoff used by key value databases to improve availability. The indexing APIs are synchronous and guarantee that once a write is acknowledged, it will be persisted and will not fail with validation errors in future. This is achieved by writing the documents to lucene buffer in the sync path of request, which does the mapping validation implicitly. If we had a system where acknowledging the write does not require knowing the existing state of the system, this approach will work. But in reality, this is not true due to API guarantees of Opensearch, not due to the limitations of the underlying system.
The ghost writer approach has value even for higher availability for remote backed storage where the translog replay to replica could take longer. We need to identify use-cases or guarantees that it will break and explicitly handle those guarantees in the API itself? @nknize Thoughts? |
@shwetathareja If you do complete isolation, where would operations like update_by_query and delete_by_query run? Even without autoscaling requirements, users will need reader writer separation. So, there is value to extreme isolation e.g. I have a writer cluster that builds the index daily and a reader cluster which supports read queries and once a day gets all the new changes through a snapshot. Writer cluster has writer specific configurations e.g. large lucene buffer, optimized directory implementation for writes. The reader cluster has instances with large memory to support fast read queries. With remote backed store, this use case can be perfectly automated for the user without having to maintain 2 clusters. I agree that building it in core may not be best mechanism to achieve it and we should explore other options as explained by @nknize |
Thanks for putting together this Proposal @shwetathareja. I like the discussion we are having so far. Couple of questions:
|
Thanks @itiyama for sharing your insights!
Any mutable action would still run on write copies. The OpenSearch code remains the same, it is trying to separate the purely search requests.
Totally agree, customers wouldn't need expensive solutions built with CCR (Cross cluster replication) if same cluster can provide the isolation they are looking for. |
@kkmr, thanks for your comments.
All search-replicas would sync from primary for local storage and for remote store, it can sync from remote store directly. It would depend on how recoveries are configured for different storage.
No, primary wouldn't wait for search-replicas. They can lag from primary but eventually would fail if they can't catch up within configured thresholds. We would need to figure out async mechanism to do the same.
That's a good point. I think someone brought it up during segrep discussion for local storage. I would see that as an optimization once we have core design laid out.
Today also snapshots are taken from primary. So, that should remain the same.
Scroll and PIT shouldn't be impacted in the sense. e.g. active scroll context could prevent segment merges and would require keeping old data around. In general, segment replication should address these concerns and should be independent of this separation unless i am missing something.
async queries should be served from search-replicas. |
You could think of this as dynamic sharding without the rehashing and rebalancing of a permanent resharding. Permanent resharding may still be necessary in the case of shard hotspots or full cluster red-lining.
It puts the burden of providing conflict resolution on the reducer, not the user. Lucene provides two
Right. Neither case would require a shard level routing table. Only the second case would require a cluster level reroute based on selecting a node with the most available resources. To explore this I started playing around with a scaffolding for adding new OperationRouting to the existing Routing logic based on a new AdaptiveWriteStats mechanism that looks like ARS but includes IndexingPressure heuristics. These changes are not rote refactors, they do require some new interesting logic and the overall fault tolerance will require a few new mechanisms for handling failover on ghost writers (e.g., in on-prem peer to peer cluster mode ghost writers would still forward write operations to replicas for xlog durability in the event a ghost writer fails - just like exsting primary writes).
We would not blindly ack the request, and the approach still works. The ack mechanism would not change. A single Index API call will ack after the call.
No. W/ the LeafReader reducer and IndexWriter.addIndexes it's not necessary. GhostWriter is just a new empty IndexWriter for the primary shard. No existing segments on disk for a GhostWriter node so think of it as an empty primary shard. Once the primary is overloaded and its write pool is lagged/exhausted we simply route all write operations for that primary (from that point forward) to an underused node and begin writing to a whole new set of segments. Once that batch is complete we
Today |
Exactly. That's how it's done today. Segment files are refCounted so they are protected from deletion until all PIT completes and the refCount drops to 0. This affects things like DiskWatermark which could cause disk pressure. This is why Remote Store backed Index is useful for scaling storage. |
I think I'll tease out the Parallel Write conversation (Ghost Writer) from the Node Separation conversation (this issue) by opening a separate issue. We can keep this RFC scoped to just the node role scenario. If we decide to implement node separation as described here then let's work that in a separate plugin (call it |
My point here is that the ack is not a simple ack for index request. We return detailed exceptions to our customers when the request cannot complete. Customers may be taking an action when we do not send them an exception message. Let me explain with a few examples:
I may be missing something here. If you could explain how we will keep the experience consistent with current state for the above cases, it would be great.
I was wondering that you are trying to get around the limitation of static sharding in Opensearch, which could be solved by just changing the doc routing strategy through simple principles of consistent hashing. Then you need not do rehashing across shards and you effectively get the same benefits. Additional metadata on the hash space ranges is maintained per shard in cluster metadata. The downside is that re-sharding approach splits the shard first by deleting the documents from each half and then merge them later as opposed to your approach of just doing the merging later. The cost of creating a new writer is minimal, which is what you need when traffic increases immediately. Ghost sharding is better unless there is a conflict resolution downside that we haven't uncovered yet. Ghost writer is also better as it avoids a lot of refactor. Assumption around shard count being static for an index is everywhere in our code. I have a few more questions on reducer. Will ask them once I get a reply to this. |
If GhostWriter is empty and unaware of the existing documents, then it can't execute the updates, user can update only few fields and keep the other values same. Also to prevent conflicts during updates, the internal _version or user provided version is used for optimistic concurrency control. How would reducer merge these conflicts? Today, OpenSearch doesn't guarantee the order of update request but they are applied one after the other and not overwritten completely. Initially, I was wondering if this GhostWriter / ShadowWriter is proposed only for append-only uses. |
I feel primary/ replica separation in core would benefit any kind of user (on prem/ cloud provider). In terms of extreme separation we can review what should be in core vs plugin. I will think more on this and get back. |
The reducer would have to "lazy" reconcile the update. The challenge it poses is when to ack so we would need to think a bit about that. A simple phase 0 with minimally invasive changes might be to "best effort" GhostWrite the update requests on replicas only. I have some other thoughts and will put them in a separate design discussion issue. |
AFAIK, this is possible as long as the ack to the user does not really mean that the update will be applied. Conflict resolution within a single document can be done for some update cases as well. It will be like the famous shopping cart example of merging carts. If a user added 2 different fields, both can be added. If a user updated an existing field, the one added later will be applied for the contingency where ghost is adaptively moved to a different node. For parallel ghost writers, it would be a bad experience, so I think we should just use ghost writer as a mechanism to move the shard quickly to a different node to start with. |
Background:
Today, in OpenSearch indexing and search are supported via same “data” role. This causes indexing and search workload to adversely impact each other. An expensive query can take up all memory and cpu causing indexing requests to fail or otherwise. On one side we have log analytics customers who want to scale for their millions of docs per sec indexing while on the other side, there are search customers who configure large no. replicas like 20 replicas to support their search traffic and what they both really want is scale for their workload respectively. But currently, OpenSearch doesn’t support isolating indexing and search workloads.
Goal:
Support reader and writer separation in OpenSearch to provide predictable performance for indexing and search workload so that sudden surge in one shouldn’t impact the other. This document talks about how we can achieve this separation in OpenSearch.
Terminology:
Benefits:
Proposal
In order to achieve Indexing and Search separation, we would build on the new node role “search”. With searchable snapshots, work has already started for “search” role for searching read only indices from snapshots. #4689. This RFC will expand on that effort. The “search” node role would act as dedicated search nodes. The term “replica” would split in the core logic as “write-replica” and “search-replica”. The write-replica/ primary would be assigned on the node with “data” role whereas search-replica will be assigned to the node with “search” role.
Request Handling and Routing
The primary copy or write-replica can support both read and write and this would be the default configuration where user has not configured any “search-replica” or “search” nodes explicitly. The user would have control at the request level to decide whether the request should be handled by “search-replica” only or fallback to primary or “write-replica” in case search-replica is not available. The latter would be the default mode.
Lets take a look at the the flow for a bulk/ search request (RequestRouting):
Search role
The search role on the nodes can be configured in 2 ways:
For # 2, The proposal is to go with 2.a where we shouldn’t allow same node to have both search and data role as it would defeat the purpose of introducing this separation in the first place. Also if you think more, 2.b essentially becomes current model where any shard can land on any data node and support both read and write.
This separation would enable users to think about their indexing and search workload independent of each other and plan for their capacity accordingly e.g. the search nodes can be added/ removed on demand without impacting the indexing throughput and vice versa.
Characteristics of different shard copies
Cluster State changes
Another important aspect is ShardAllocation. With new “search” role and different replicas, shard allocation logic will also have changes to accommodate them.
From leader perspective when assigning shards:
Today, Leader maintains list of up-to-date copies for a shard in the cluster state as “in-sync allocations”. With reader/ writer separation, search-replicas can be more flexible and can lag more from primary compared to write-replica. There should be a mechanism for leader to track search-replicas and fail them in case they are not able to catch up within the configured thresholds.
Index (Cluster) Status - Green/ Yellow/ Red
Today, in OpenSearch if all the copies of shard are assigned, then index status is green. If any of the replicas is unassigned it turns yellow and if, all the copies (including primary) are unassigned then it turns red. This works well when single copy can serve both the functions i.e. indexing and search. With Reader/ Writer separation, red would be panic signal only for writable copies but it doesn’t indicate if all the read copies (search-replicas) are unassigned. We would need a way to indicate read and write copies health separately.
User configurations
Why do we need to configure search-replicas with new setting?
The current “replica” setting is overloaded to act as “write-replica” in case of role separation. There is no other way to configure search-replica without addition of new setting. The current replica setting could have meant search-replica instead of write-replica but this would complicate the default configuration where user has not configured node with “search” and replicas would go unassigned.
How to turn off primary shard copy in case search-replicas are enabled?
With reader/ writer separation, user can choose to turn off primary and set write-replicas to 0 in case they don’t expect any writes to that index. e.g. log analytics customer, after rolling over over an index can set all write copies to 0 for old index and only configure “search-replica” as needed. Today, OpenSearch doesn’t offer any setting to turn off primary. This could be an explicit setting or derive it implicitly if an index is marked read-only.
Tuning of buffers/ caching based on workload
There are different node level buffers and caching threshold which are configured considering the same node will serve both indexing and search workload. Those could be fine tuned better now with the separation.
Auto management with ISM
Index State management plugin will simplify some of these aspects for the user where they don’t need to configure different replicas explicitly and ISM policies can take care of it. e.g. during migration of an index from hot to warm, it can configure primary and write-replica to 0 and set search-replica to n as configured in the policy.
A note on the underlying storage
Ideally, reader and writer separation is independent of underlying storage (local or remote) or index replication algorithm (document or segment replication). But, current document replication with local storage can’t offer the isolation between writer and reader for real as both nodes (data & search) would do same amount of work for indexing. There is no concrete plan yet for remote store with document replication. So, in the first phase it will be segment replication enabled indices which will benefit from this reader/ writer separation.
Separating Coordinators
Today, Coordinator role handles the responsibility of coordinating both indexing and search requests. The above proposal doesn't talk separating coordinators yet in OpenSearch. I would create a separate issue to discuss coordinator role separation in detail.
Thanks @dblock for calling out to capture it separately. Regarding the discussion check the comment
Future Improvements
The reader and writer separation would lay the ground work for lot more improvements in the future like auto scaling of read and write copies, partial fetching via block level fetch or on demand fetch segment/ shard only when there is read or write traffic etc. Also, in future we can look into separating segment merges also to dedicated nodes. Some of these improvements are also discussed here - #6528
How can you help?
Any feedback on the overall proposal is welcome. If you have specific requirements/ use cases around indexing and search separation which are not addressed by the above proposal, please let us know.
The text was updated successfully, but these errors were encountered: