Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggested operations for moving raft shards across replicas in cluster (v4) #377

Open
danthegoodman1 opened this issue Nov 23, 2024 · 7 comments

Comments

@danthegoodman1
Copy link

danthegoodman1 commented Nov 23, 2024

I'm playing with using multi-raft in a range-partitioned DB, and one thing I am working on in particular is being able to move raft shards around the cluster to balance load. I have a couple questions, and looking for any general advice on the topic, specifically for:

  1. Moving raft shards among the cluster
  2. Transferring the leader to the intended node
  3. Moving raft shards back to nodes that were previously moved away from

Moving raft shards

My assumption for moving raft shards is to:

  1. Join new replicas to the shard as voting replicas (e.g. join 3 more to the initial 3 in a worst-case move)
  2. Once all new replicas have successfully joined, start deleting the initial replicas
  3. Once all old replicas are removed, NodeHost.RequestLeaderTransfer on the intended new leader (least loaded replica)

Is this correct? Or is there a better way to do partial/full replica movement?

Transferring leadership

I notice that for NodeHost.RequestLeaderTransfer it is stated There is no guarantee that such request can be fulfilled, and is not also something you can wait a channel for like other Request* operations. I'm assuming this is not guaranteed because it would not be able to select the replica that's not as caught up as other replicas? Maybe some other reasons.

And so, how would you suggest that Transferring leadership is done when a specific shard is being moved because of high load for the overall replica?

Maybe something like https://github.com/lni/dragonboat/blob/master/nodehost_test.go#L1709 could be used to wait for leadership change, or time out and error/warn log (or retry)?

Moving back to nodes that were previously removed

Additionally, I notice that for NodeHost.RequestDeleteReplica it is stated Once a node is successfully deleted from a Raft shard, it will not be allowed to be added back to the shard with the same node identity.. Would the suggestion then be that node:shard relationships be dynamic? Otherwise this shard could never be moved back on to nodes that it previously lived on. I don't think a dynamic node name seems natural, but maybe it could be encoded like {count of how many times that shard has been on this node id} * 100_000 + {real node id}, and then the cluster is technically limited to 100k nodes? Or should a mapping of real node -> shard replica ID be tracked in a meta shard?

@danthegoodman1 danthegoodman1 changed the title Suggested operations for moving raft shards across clusters (v4) Suggested operations for moving raft shards across replicas in cluster (v4) Nov 23, 2024
@danthegoodman1
Copy link
Author

https://github.com/lni/dragonboat/blob/master/nodehost_test.go#L2621-L2634 seems to conflict with the testing practices from the README as well

@lni
Copy link
Owner

lni commented Nov 24, 2024

  1. move raft shards

you may want to join new replicas as non-voting member first. this allow them some time to catch up with the progress of your shard without impacting the the underlying consensus mechanisms. once they are up to speed (e.g. a ReadIndex read can be successfully completed on those non-voting members), you can promote them to voting members.

leadership transfer can be optional, unless you require a certain replica to be the leader

  1. leadership transfer

there are many exceptions that can happen during the leadership transfer, such request is thus always considered as following a best effort approach. you can wait some time and then check who is the leader now. again, such check doesn't mean too much, as the leadership can just change any time after your check is done (before you get the check result).

note that such issues are typical distributed system problems, there is simply no such strong guarantees similar to those you'd be expecting from a single machine system.

  1. Moving back to nodes that were previously removed

you are free to do whatever you want regarding those replica IDs, the only limitation here is not to have any duplication. with the replica ID value being uint64, a random value should be good enough assuming it is indeed reasonably random.

@kevburnsjr
Copy link
Contributor

kevburnsjr commented Nov 24, 2024

  1. Transferring leadership

Rather than actively polling, you could pass a RaftEventListener in your NodeHostConfig. Each host's RaftEventListener will be automatically notified of all leadership changes on all shards for which the host has an active replica. From my testing, this hook is highly reliable.

  1. Moving back to nodes that were previously removed

I think there's a typo in the documentation for RequestDeleteReplica mixing v3 and v4 nomenclature that may be causing some confusion. In v4 nomenclature it should read:

"Once a replica is successfully deleted from a Raft shard, it will not be allowed to be added back to the shard with the same replica identity."

  • Each host can have at most one active replica per shard
  • Once a replica is removed, it cannot be re-added
  • A new replica must be created for the host

To re-join a host to a shard after its replica has been removed you just need to call RequestAddReplica from an active host with a new (unique) replica id and your new host as the target.

@danthegoodman1
Copy link
Author

Thanks for the details folks! For a random replica id I guess that would mean it now has to be persisted since it’s decoupled from a static node id and they need to be able to restart with the same replica id

@lni
Copy link
Owner

lni commented Nov 25, 2024

@danthegoodman1

from experience, quite often a production system will have a separate devops sub-system to keep monitoring those involved nodes and start repairing shards when any replica is dead. by doing that, it will have to somehow remember all replica IDs anyway.

@kevburnsjr
Copy link
Contributor

kevburnsjr commented Nov 25, 2024

No need to persist the replica ID out of band. It is persisted by the host.
Restarting a node/host looks like this:

  1. Instantiate the NodeHost
  2. Call GetNodeHostInfo
  3. Iterate returned ShardInfoList
  4. StartReplicas

@danthegoodman1
Copy link
Author

Oh fantastic, thanks for sharing!

from experience, quite often a production system will have a separate devops sub-system to keep monitoring those involved nodes and start repairing shards when any replica is dead. by doing that, it will have to somehow remember all replica IDs anyway.

Yes I was planning on having a "meta shard" that tracks the existence and range ownership of other groups (and makes decisions for balancing shards among the cluster, node join, etc.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants