diff --git a/20240912-R-scheduler-auto-scale.md b/20240912-R-scheduler-auto-scale.md new file mode 100644 index 0000000..8ffe6e0 --- /dev/null +++ b/20240912-R-scheduler-auto-scale.md @@ -0,0 +1,157 @@ +# Auto-Scale the Scheduler service + +* Author(s): Cassie Coyle (@cicoyle) +* State: Review +* Updated: 2024-09-12 + +## Overview + +This proposal outlines a design for auto-scaling the Schedulers, which is needed for the Scheduler service to be Stable. +The Scheduler's internal cron scheduler library code will need to be refactored to be dynamic. Runtime will need to be +updated to also account for dynamically connecting to Schedulers when they are scaled up or down. The Scheduler server +code will need to be updated such that it optionally runs embedded etcd on different Scheduler instances. + + +## Background + +In Dapr v1.14 we introduced the Scheduler control plane service to support the Jobs API along with scaling the Actor +reminder system and the Workflow API (since it uses Actor reminders under the hood). The Scheduler encompasses running +itself as a new process, an embedded etcd instance, and an internal cron scheduler. This proposal will be detailing how +to support the auto-scaling of Schedulers and secondly enhancing the Schedulers such that we optionally run etcd +instances in the Schedulers under certain circumstances that make sense. When scaling beyond 3 etcd instances, there is +little performance improvement, so it makes sense for Schedulers to become stateless at this point. For example, we +could have 20 Schedulers, but only 3 of them would contain the embedded etcd. + +## Related Items + +### Related issues + +[Tracking Issue: Scheduler Stable](https://github.com/dapr/dapr/issues/7796) && +[Auto Scale Scheduler Issue](https://github.com/dapr/dapr/issues/7912). + +## Expectations and alternatives + +There will be 3 changes needed for this feature to be complete. + +1. **New control loop** for the [go-etcd-cron library](https://github.com/diagridio/go-etcd-cron), such that the leadership table is able to be dynamic. This +means we need to watch the leadership table to observe a new Scheduler starting up or going down, stopping all Schedulers +job triggers for that moment until the leadership table is updated and synchronized. We will need to stop the job queue +and rebuild it again by reshuffling all the jobs to evenly split up the job ownership amongst Schedulers. Refactoring is +needed here such that the queue logic can be stopped and restarted. The leadership table will be updated to additionally +include the host address as the data such that the key in the table remains the id, but the value becomes the proposed +protobuf below, including the observed total replicas, host address, and port. + +Leadership in this case refers to partition leadership, which is responsible for evenly dividing up the jobs amongst the +available Schedulers. It's unrelated to etcd clustering or consensus. The Schedulers share job triggering responsibilities +across the replicaset, based on the partition ID relative to the total number of total partitions. Partition leadership +ensures that no 2 Schedulers are working on the same nth partition, and that partitions are non-overlapping because they +agree on the same observed total partition count. + +In a new `leadership.proto` file in the `go-etcd-cron` library we will have the following to dynamically track hosts +available for job ownership distribution. +```protobuf +// Leadership is the message written to the leadership table when the replica +// gains ownership of the leader key. +message Leadership { + // total is this replicas understanding of the total number of partition + // replicas. + uint32 total = 1; + + // leader_data is custom data that is associated with this leader. All leader + // data will be sent to library consumer on leadership table updates. + optional google.protobuf.Any leader_data = 2; +} +``` + +Initially we were storing the partition total as a string in etcd, but with this design we will store the above protobuf, +in string form due to an etcd requirement. This approach will **not** break users since we are doing a string comparison +in the library, we don't panic, and this data in particular is ephemeral anyways. + +By introducing this optional field, we make the library extensible and useful for other use cases beyond of the +Scheduler, while ensuring a non-breaking, backwards-compatible change. + +On the Scheduler side, we’ll implement the following change in a new `dapr/proto/internals/scheduler.proto`: +```protobuf +// SchedulerPartitionOwner is the data which is written to the cron leadership +// table. +message SchedulerPartitionOwner { + // host is the host address of this partition owner. + string host = 1; + + // port is the port of this partition owner. + uint32 port = 2; +} +``` + +Upon startup, each Scheduler will retrieve its own host and port, and pass that into to the cron library via the cron +library `Options` here: https://github.com/dapr/dapr/blob/master/pkg/scheduler/server/server.go#L193. This information +will feed into the new proposed control loop in the cron library ensuring that as Schedulers scale up or down, the +job triggering load is evenly distributed among them. + +- An optimization we can do after the initial work is done, is to keep all jobs in memory. Having a local cache +would optimize building the queue faster, since we are using etcd this should be feasible. + +2. **A new WatchHosts RPC** is added to `scheduler.proto`, allowing the dapr sidecar to connect to any available Scheduler +instance via a ClusterIP service, rather than needing to manually track and connect to each instance as in the current +headless service model. This `WatchHosts` RPC dynamically provides the current set of active Scheduler instances to the +sidecar, keeping it updated as Schedulers scale up or down. + +- For **backward compatibility**, we will check to determine whether the `WatchHosts` RPC is available or not. If it +exists, Dapr will use it to manage Scheduler connections dynamically. If not, Dapr will continue using the existing +static connection logic, maintaining compatibility with earlier versions by sending the full list back of available +Schedulers. + +```protobuf +service Scheduler { + ... + rpc WatchHosts(WatchHostsRequest) returns (stream WatchHostsResponse) {} +} + +message HostPort { + string host = 1; + uint32 port = 2; +} + +message WatchHostsResponse { + repeated HostPort hosts = 1; +} + +message WatchHostsRequest {} + +``` + +3. **1 or 3 Schedulers with respective embedded etcd instances, beyond 3 instances run Schedulers stateless**. We want to +optionally run embedded etcd in Scheduler for cases where it makes sense. Cases to consider: +- running 1 Scheduler means running it with embedded etcd +- running 2 Schedulers means running the first with embedded etcd, the second Scheduler will be stateless and connect to +the existing etcd +- running 3 Schedulers means running all with embedded etcd +- running 4 Schedulers means running 3 Schedulers with embedded etcd, and the last one will be stateless and connect to +an existing etcd. Note, here there is no gain to round-robin-ing between the existing etcd instances due to the way +etcd works with raft under the hood. +- running 5 Schedulers means running 3 Schedulers with embedded etcd, the last two Scheduler instances will be stateless. +- running beyond 5 Schedulers means running 3 Schedulers with embedded etcd, and any Scheduler beyond 3 will be +stateless and connect to an existing etcd instance. + +![Stateless Schedulers Example](./resources/20240912-R-scheduler-auto-scale/statelessSchedulersExample.png) + +etcd must run with an odd number of replicas. In dapr, 3 will be our practical limit while running inside Schedulers. +With the cron scheduler library and embedded etcd already separate processes in Scheduler, we just need to optionally +run the embedded etcd. + +Regarding [etcd's cluster reconfiguration operations guide](https://etcd.io/docs/v3.3/op-guide/runtime-configuration/#cluster-reconfiguration-operations), +for adding members, or etcd instances, we will need to set the `initial-cluster-state` to be existing for the new +instances. We do not need to update the configuration of existing members manually, as once the new members join the +cluster, it will automatically be integrated into the consensus and the cluster's state will include the new member. + +As noted in their documentation, we _**should**_ set the `strict-reconfig-check` flag on etcd such that we reject +reconfiguration requests if the number of started members will be less than a quorum. This flag will ensure we avoid +quorum loss as we dynamically add or remove stateful Schedulers. + +## Completion Checklist + +- [ ] Write the new control loop for the go-etcd-cron library to make the leadership table & job queue dynamic +- [ ] Add WatchHosts to scheduler.proto & update code to use it while ensuring backwards compatibility +- [ ] 1 or 3 Scheduler etcd problem where Schedulers might be stateless +- [ ] Tests +- [ ] Document that some Schedulers might be stateless \ No newline at end of file diff --git a/resources/20240912-R-scheduler-auto-scale/statelessSchedulersExample.png b/resources/20240912-R-scheduler-auto-scale/statelessSchedulersExample.png new file mode 100644 index 0000000..56920db Binary files /dev/null and b/resources/20240912-R-scheduler-auto-scale/statelessSchedulersExample.png differ diff --git a/templates/proposal.md b/templates/proposal.md index e5c03b7..503c884 100644 --- a/templates/proposal.md +++ b/templates/proposal.md @@ -13,7 +13,7 @@ A brief description of the proposal; include information such as: ## Background -This section is intented to provide the community with the reasoning behind this proposal -- why is this proposal being made? What problem is it solving for users / developers / operators and how does it solve that for them? +This section is intended to provide the community with the reasoning behind this proposal -- why is this proposal being made? What problem is it solving for users / developers / operators and how does it solve that for them? ## Related Items