-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Tiering/Migration of indices from hot to warm where warm indices are mutable #13294
Comments
Would love to get feedback from the community for this: @andrross @sohami @reta @mch2 @shwetathareja @Bukhtawar @rohin @ankitkala |
@neetikasinghal thanks for sharing the different design options. nitpick : Please use ClusterManager in place of Master terminology. Trying to understand Option 2 which is your preferred choice:
Dedicated here refers to dedicated warm node? Ideally, the tiering logic shouldn't be tightly coupled with cluster configuration whether it has dedicated warm nodes or not. Dedicated warm node basically dictate shard allocation logic on whether a shard stays on the current node post tiering or move to another node as it is not eligible for that node anymore
when is this accessed, at the time of migration status API or otherwise as well ? |
thanks @shwetathareja for your feedback.
Yes dedicated refers to dedicated warm nodes.
Correct, this is accessed at the time of tiering status API. |
Focusing on the option 2:
Do you have any idea what the magnitude of space saving here would be? New Index settings themselves are already in the cluster state so it seems like this would be pretty close to a net 0 change. Also the in-memory metadata you are proposing under the "Cons" section seems like it would basically cancel out any savings here.
This might be too in the details for the purposes of this issue, but it seems to me that you would need some sort of process to constantly iterate over all of the indices in order to monitor the migration status (which I understand is basically what option 3 is describing). At a high level, how would you know if something is not working for one of your state transitions or how would handle state transition failures? For example if there is some sort of allocation decider issue preventing shard relocation then how would we handle that? Aside from this, it would be really helpful to add some reasons for why option 2 is preferred since (at least for me) it's hard to understand the judgement call from just the list of pros/cons. |
Thanks for the proposal @neetikasinghal Few question:
How are we ensuring this with Option 2?
|
Storing the in progress entries in custom cluster state entry would also involve storing of index metadata like index name and index uuid at the minimum to be stored which is already present in the Index metadata and would be duplicated in the custom cluster state entry. This could mean some savings in the space for one node, however since the cluster state is replicated across all the nodes in the cluster, the savings would definitely be more. The storing of metadata is a suggestion for optimization and the metadata is kept only on the master node and not on all the nodes unlike cluster state update.
TieringService would have the logic to deal with all the scenarios. Let me explain from design 2's perspective - for non-dedicated setup, the TieringService is triggering an API to the composite directory for the shard level updates, so if there is any failure, the TieringService is aware of the failure and has the capability to deal with the failures either by a retry or marking tiering as failed. For the dedicated node setup, Tiering Service listens to the shard relocation completion notification from each of the shard. Tiering Service has the knowledge about all the shards for a given index. If there is, say one shard not able to complete the relocation, Tiering Service can figure out the reason of stuck relocation by calling the
Number of cluster state updates, saving on some space as called out above, being able to retry for the failures in non-dedicated setup, not losing the accepted migration requests, not needing to determine the polling interval (as in option 3/4) and not overwhelming the cluster with too many get calls done by the polling service makes option 2 as preferred. These points are already called out in the pros/cons though. Let me know if you have any specific question regarding any pros/cons of any of the design choices. |
For each of the accepted request, there is a cluster state update triggered which updates the index settings. The cluster state is replicated across all the master nodes, so base on the index settings, TieringService should be able to find out the all accepted requests and be able to re-construct the lost metadata on master flip/restart.
|
I would avoid a polling service to start with, I gave a basic pass wondering how is this substantially different from a snapshot custom entry where different shards undergo snapshots at various points? |
@neetikasinghal Thanks for the proposal. Couple of suggestion:
|
@Bukhtawar the proposed approach is not have a custom cluster state entry, rather rely on the index settings (that are updated on the cluster state) and in-memory metadata on the master node to store metadata of the shards. Please refer to design 2 for more details and let me know in case you have further questions. |
Is your feature request related to a problem? Please describe
This proposal aims at presenting the different design choices for tiering the index from hot to warm where warm indices are writable and the proposed design choice. The proposal described below is related to the RFC #12809. The tiering APIs to provide the customer experience have already been discussed in #12501
Describe the solution you'd like
Hot to Warm Tiering
API: POST /
<indexNameOrPattern>
/_tier/_warmResponse:
Failure:
There are two cases presented in each of the below design choices -
Case 1: Dedicated setup - cluster with dedicated warm nodes
Case 2: Non-Dedicated/Shared node setup - cluster without the dedicated warm nodes
DESIGN 1: Requests served via cluster manager node with cluster state entry and push notification from shards
In this design, custom cluster state entry is introduced to store the metadata of the indices under-going tiering.
In the non-dedicated setup case, the hot-warm node listens to the cluster state update and on detecting the change for index.store.data_locality (introduced in the PR) change from FULL to PARTIAL, the shard level updates on the composite directory is triggered. Allocation routing settings help in relocating the shards from data node to search dedicated nodes. The index locality setting value set to partial helps in initializing the shard on the dedicated node as a PARTIAL shard.
Pros
Cons
(Preferred) DESIGN 2: Requests served via cluster manager node with no cluster state entry and internal API call for shard level update
In this design, there is no custom cluster state entry stored in the cluster state.
In case of dedicated warm nodes setup, TieringService adds allocation settings to the index (
index.routing.allocation.require._role : search
,index.routing.allocation.exclude._role : data
) along with the other tiering related settings as shown in the diagram below. When the re-route is triggered, the allocation deciders run and decides to relocate the shard from hot node to warm node.index.store.data_locality: PARTIAL
helps in initializing the shard in the PARTIAL mode on the warm node during shard relocation.
In non-dedicated setup, there is a new API (more details will be provided in a separate issue) that is used to trigger the shard level update on the Composite Directory. This api can also be used for retrying in case of failures.
To track the shard level details of the tiering, the status api would give more insights on relocation status in case of dedicated nodes setup and shard level updates in the non-dedicated setup. More details on the status API would be covered in the follow-up issue.
Pros
Cons
DESIGN 3: Requests served via cluster manager node with no Cluster state entry and Polling Service
In this design, instead of relying on the notification from the shards, there is a polling service that runs periodically and checks for the status of the shards of the indices under-going tiering.
About Tiering Polling Service
Tiering polling service is a service that runs on the master node on a schedule after every x seconds defined by the interval settings. There can be a cluster wide dynamic setting to configure the value of the interval of polling. The polling interval can be different for dedicated and shared node setup as the migrations are expected to be faster in the shared node setup.
Polling service begins by checking if there are any in progress migrations. If there are no in progress migrations, the polling service doesn’t do anything. If there are any in progress index migrations, the polling service calls an API on the composite directory to check for each of the shards status for the in progress migrations.
On success of all the shards for the indices, the polling service updates the index settings of the successful indices.
The caveat with this design is that that with multiple indices in the in-progress state, the polling service has to call the status API to check the status for all shards of the in-progress indices. This could contain the shards of the indices which were already successful in the previous run of the polling service, however since one or two shards were still in the relocating state, the status check has to be re-done in the next run. To optimize this, we can store some in-memory metadata to save the information of the in-progress indices and the status of the shards in the indices. However, this metadata will be lost on a master switch or a master going down. Design 4 tries to deal with this limitation.
Pros
Cons
DESIGN 4: Requests served via cluster manager node with Cluster state entry and Polling Service
This design is similar to design 3 except that there is a custom cluster state entry to store the in progress migrations to prevent the need to keep the local metadata on the master node and avoid re-computation of the metadata on the master node switch.
Pros
Cons
Open Questions and next steps
Related component
Search:Remote Search
The text was updated successfully, but these errors were encountered: