From a031af74a4e0fd53749afe5383a8e4a4468244eb Mon Sep 17 00:00:00 2001 From: joshvanl Date: Fri, 20 Sep 2024 09:02:20 +0100 Subject: [PATCH 1/7] Scheduler Failure Policy Signed-off-by: joshvanl --- 20240919-scheduler-failure-policy.md | 118 +++++++++++++++++++++++++++ 1 file changed, 118 insertions(+) create mode 100644 20240919-scheduler-failure-policy.md diff --git a/20240919-scheduler-failure-policy.md b/20240919-scheduler-failure-policy.md new file mode 100644 index 0000000..4e1f7d1 --- /dev/null +++ b/20240919-scheduler-failure-policy.md @@ -0,0 +1,118 @@ +# Scheduler Job Failure Policy + +* Author(s): Josh van Leeuwen (@joshvanl) +* Updated: 2024-09-19 + +## Overview + +Proposal details a Scheduler queue and Job API extension to support controlling behaviour of Job triggering in the event of failure. + +## Background + +The [Scheduler](https://docs.dapr.io/concepts/dapr-services/scheduler/) (and [go-etcd-cron](https://github.com/diagridio/go-etcd-cron/)) library are responsible for managing and executing jobs of all target consuming types. +When a Job is triggered, it is sent on a gRPC streaming connection from Scheduler to a connected daprd that implements that [Job target](https://github.com/dapr/dapr/blob/da6fb0db46b4d2932640eeaaaccf8b76f248f388/dapr/proto/scheduler/v1/scheduler.proto#L115). +In the event that this fails, for example if the trigger itself fails or there are no daprd instances connected for that target, the Job will currently still be marked as "triggered" a.k.a. "ticked" on the queue backend. +While always ticking failed jobs can be desirable behaviour, this is not always the case- and the applications often requires the job trigger to be retried multiple times on that tick to ensure durability of the schedule. + +## Expectations and alternatives + +1. The [go-etcd-cron](https://github.com/diagridio/go-etcd-cron/) library is to be updated to support a new `FailurePolicy` mechanism to correctly re-schedule jobs in the event of trigger failure. +2. The Scheduler Job API will mirror the new options available in the cron library. + The scheduler will correctly signal to the library in the event of a trigger failure. +3. The runtime Jobs API will be updated to support the new `FailurePolicy` options. +4. The workflow Actor Reminders will be updated so that they will be marked as only to be triggered once, with an appropriate retry failure policy when using Scheduler. + + +## Design + +To begin with, we will support 3 failure policies: +1. `Drop`: the job trigger will not be retried and the job will be marked as ticked. + This is the current behaviour and will continue to be the default behaviour for all jobs. +2. `Constant`: the job will be retried as a constant time interval, up to a maximum number of retries (which could be infinite). +3. `Schedule`: the job will be retried according to a [cron scheudler](https://github.com/diagridio/go-etcd-cron/blob/2a1c6747974627691165eb96a2ca0202285d71eb/proto/job.proto#L68), up to a maximum number of retries (which could be infinite). + +### go-etcd-cron + +To support the new `FailurePolicy` options, the go-etcd-cron, we first need to keep track of the current number of attempts the current Job count has been triggered on a particular tick. +We also need to keep track of the time at which that attempt was made in order to calculate the next attempt time according to the `FailurePolicy` schedule. +Like the trigger time, the last attempt time is the virtual _correct_ time that the attempt was to be made, rather than the observed wall clock time. +This is to ensure durability of the Scheduler during events of slow down or down time/restarts. +The attempts and last attempt time will added to the Counter proto message, and managed by the trigger queue manager. +Attempts will be 0 and last attempted time will be `nil` for all new tick counts of a Job. + +```proto +// Counter holds counter information for a given job. +message Counter { + ... + + // attempts is the number of times the job has been attempted to be + // triggered at this count. + uint32 attempts = 4; + + // last_attempted is the timestamp the job was last attempted to be triggered. + optional google.protobuf.Timestamp last_attempted = 5; +} +``` + +Below are the proto definitions for the new `FailurePolicy` options. +Both constant and cron policies include an optional max retries option to limit the number of retries according to the number of attempts. +The failure policy message is added as an optional field to the Job message. +If unset, the failure policy of a Job is `Drop`. + +```proto +// FailurePolicy defines the policy to apply when a job fails to trigger. +message FailurePolicy { + // policy is the policy to apply when a job fails to trigger. + oneof policy { + FailurePolicyDrop drop = 1; + FailurePolicyConstant constant = 2; + FailurePolicyCron cron = 3; + } +} + +// FailurePolicyDrop is a policy which drops the job tick when the job fails to +// trigger. +message FailurePolicyDrop {} + +// FailurePolicyRetry is a policy which retries the job at a consistent +// delay when the job fails to trigger. +message FailurePolicyConstant { + // delay is the constant delay to wait before retrying the job. + google.protobuf.Duration delay = 1; + + // max_retries is the optional maximum number of retries to attempt before + // giving up. + optional uint32 max_retries = 2; +} + +// FailurePolicyCron is a policy which retries the job according to a cron +// schedule. +message FailurePolicyCron { + // schedule is the cron schedule at which to retry the job. + // See the Job.schedule field for the format of schedule. + // Must not be empty. + string schedule = 1; + + // max_retries is the optional maximum number of retries to attempt before + // giving up. + optional uint32 max_retries = 2; +} +``` + +```proto +message Job { + ... + + // failure_policy is the optional policy to apply when a job fails to + // trigger. + // By default, the failure policy is drop- meaning the Job tick will be + // dropped or ignored in the event of failure. + optional FailurePolicy failure_policy = 7; +} +``` + +## Dapr + +The Scheduler Job API service mirrors the new `FailurePolicy` options available in the go-etcd-cron library. +Similarly, the runtime Jobs API will be updated to support the same new `FailurePolicy` options. +When using Scheduler, Workflows will change Actor Reminders to now be single shot Jobs with a constant failure policy of every 5 seconds and a maximum attempts of 120 (10 minutes). From 32b756277ffef3cbccba22668cd71a384ab0396b Mon Sep 17 00:00:00 2001 From: joshvanl Date: Fri, 20 Sep 2024 09:08:12 +0100 Subject: [PATCH 2/7] Formatting Signed-off-by: joshvanl --- 20240919-scheduler-failure-policy.md | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/20240919-scheduler-failure-policy.md b/20240919-scheduler-failure-policy.md index 4e1f7d1..33e059c 100644 --- a/20240919-scheduler-failure-policy.md +++ b/20240919-scheduler-failure-policy.md @@ -9,10 +9,10 @@ Proposal details a Scheduler queue and Job API extension to support controlling ## Background -The [Scheduler](https://docs.dapr.io/concepts/dapr-services/scheduler/) (and [go-etcd-cron](https://github.com/diagridio/go-etcd-cron/)) library are responsible for managing and executing jobs of all target consuming types. +The [Scheduler](https://docs.dapr.io/concepts/dapr-services/scheduler/) (and [go-etcd-cron](https://github.com/diagridio/go-etcd-cron/) library) are responsible for managing and executing jobs of all target consuming types. When a Job is triggered, it is sent on a gRPC streaming connection from Scheduler to a connected daprd that implements that [Job target](https://github.com/dapr/dapr/blob/da6fb0db46b4d2932640eeaaaccf8b76f248f388/dapr/proto/scheduler/v1/scheduler.proto#L115). In the event that this fails, for example if the trigger itself fails or there are no daprd instances connected for that target, the Job will currently still be marked as "triggered" a.k.a. "ticked" on the queue backend. -While always ticking failed jobs can be desirable behaviour, this is not always the case- and the applications often requires the job trigger to be retried multiple times on that tick to ensure durability of the schedule. +While always ticking failed jobs can be desirable behaviour, this is not always the case- and applications often require the job trigger to be retried multiple times on that tick to ensure durability of the schedule. ## Expectations and alternatives @@ -33,7 +33,7 @@ To begin with, we will support 3 failure policies: ### go-etcd-cron -To support the new `FailurePolicy` options, the go-etcd-cron, we first need to keep track of the current number of attempts the current Job count has been triggered on a particular tick. +To support the new `FailurePolicy` options, we first need to keep track of the current number of attempts the current Job count has been triggered on a particular tick. We also need to keep track of the time at which that attempt was made in order to calculate the next attempt time according to the `FailurePolicy` schedule. Like the trigger time, the last attempt time is the virtual _correct_ time that the attempt was to be made, rather than the observed wall clock time. This is to ensure durability of the Scheduler during events of slow down or down time/restarts. @@ -56,6 +56,7 @@ message Counter { Below are the proto definitions for the new `FailurePolicy` options. Both constant and cron policies include an optional max retries option to limit the number of retries according to the number of attempts. +If max retries is unset, the Job will be retried indefinitely. The failure policy message is added as an optional field to the Job message. If unset, the failure policy of a Job is `Drop`. @@ -82,6 +83,7 @@ message FailurePolicyConstant { // max_retries is the optional maximum number of retries to attempt before // giving up. + // If unset, the Job will be retried indefinitely. optional uint32 max_retries = 2; } @@ -95,6 +97,7 @@ message FailurePolicyCron { // max_retries is the optional maximum number of retries to attempt before // giving up. + // If unset, the Job will be retried indefinitely. optional uint32 max_retries = 2; } ``` @@ -115,4 +118,4 @@ message Job { The Scheduler Job API service mirrors the new `FailurePolicy` options available in the go-etcd-cron library. Similarly, the runtime Jobs API will be updated to support the same new `FailurePolicy` options. -When using Scheduler, Workflows will change Actor Reminders to now be single shot Jobs with a constant failure policy of every 5 seconds and a maximum attempts of 120 (10 minutes). +When using Scheduler, Workflows will change Actor Reminders to now be single shot Jobs with a constant failure policy of every 5 seconds, and a maximum attempts of 120 (10 minutes). From 43a5f4185dc59d754a897cf83ee4140add3e3df3 Mon Sep 17 00:00:00 2001 From: joshvanl Date: Fri, 20 Sep 2024 09:22:30 +0100 Subject: [PATCH 3/7] Adds future design on staging queue Signed-off-by: joshvanl --- 20240919-scheduler-failure-policy.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/20240919-scheduler-failure-policy.md b/20240919-scheduler-failure-policy.md index 33e059c..96ee7ae 100644 --- a/20240919-scheduler-failure-policy.md +++ b/20240919-scheduler-failure-policy.md @@ -31,6 +31,12 @@ To begin with, we will support 3 failure policies: 2. `Constant`: the job will be retried as a constant time interval, up to a maximum number of retries (which could be infinite). 3. `Schedule`: the job will be retried according to a [cron scheudler](https://github.com/diagridio/go-etcd-cron/blob/2a1c6747974627691165eb96a2ca0202285d71eb/proto/job.proto#L68), up to a maximum number of retries (which could be infinite). +### Future Design + +Although not part of the proposal, in future we can extend Scheduler to include a staging queue which is dedicated for Jobs where no current stream implements its target. +This addition would mean that these Jobs are not needlessly being attempted to be triggered, freeing up main queue resources and preserving the intended failure policy. +In the event a stream implementing the target is connected, the Job can be moved to the main queue and triggered immediately. + ### go-etcd-cron To support the new `FailurePolicy` options, we first need to keep track of the current number of attempts the current Job count has been triggered on a particular tick. From 52f2bfb2eb84dd3abc37978ca64f67fe1b65822d Mon Sep 17 00:00:00 2001 From: joshvanl Date: Fri, 20 Sep 2024 09:28:06 +0100 Subject: [PATCH 4/7] Expands staging queue Signed-off-by: joshvanl --- 20240919-scheduler-failure-policy.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/20240919-scheduler-failure-policy.md b/20240919-scheduler-failure-policy.md index 96ee7ae..e3c0b99 100644 --- a/20240919-scheduler-failure-policy.md +++ b/20240919-scheduler-failure-policy.md @@ -34,8 +34,9 @@ To begin with, we will support 3 failure policies: ### Future Design Although not part of the proposal, in future we can extend Scheduler to include a staging queue which is dedicated for Jobs where no current stream implements its target. -This addition would mean that these Jobs are not needlessly being attempted to be triggered, freeing up main queue resources and preserving the intended failure policy. -In the event a stream implementing the target is connected, the Job can be moved to the main queue and triggered immediately. +This addition would mean that the Jobs with no target are not needlessly being attempted to be triggered, freeing up main queue resources and preserving the intended failure policy. +Upon trigger, if no target stream is connected, the job will be moved to this staging queue. +In the event a stream implementing the target is connected, the Job can be moved to the main queue and triggered immediately or on its proper schedule. ### go-etcd-cron From f131c4e445d4b243c8aa0bdc9234c0700593b018 Mon Sep 17 00:00:00 2001 From: joshvanl Date: Fri, 20 Sep 2024 09:30:48 +0100 Subject: [PATCH 5/7] Adds notes to workflow actor reminders Signed-off-by: joshvanl --- 20240919-scheduler-failure-policy.md | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/20240919-scheduler-failure-policy.md b/20240919-scheduler-failure-policy.md index e3c0b99..0d74b60 100644 --- a/20240919-scheduler-failure-policy.md +++ b/20240919-scheduler-failure-policy.md @@ -125,4 +125,11 @@ message Job { The Scheduler Job API service mirrors the new `FailurePolicy` options available in the go-etcd-cron library. Similarly, the runtime Jobs API will be updated to support the same new `FailurePolicy` options. + When using Scheduler, Workflows will change Actor Reminders to now be single shot Jobs with a constant failure policy of every 5 seconds, and a maximum attempts of 120 (10 minutes). +Making this change to workflow reminders is desirable for a number of reasons: + +- The current 1 minute interval of workflow reminders is often not appropriate, as it is either too long or too short. +- Current implementation of "cancelling" a workflow reminder is fragile and often does not work as expected. +- Removes the Delete Reminder code in the workflow runtime which is adding another round trip. +- Ensures there is never a case of "double trigger" of a workflow reminder, which is a suspected current source of flakiness in workflows. From cd7f48915e90d9e22e3b7b6c6e45e1067c078b00 Mon Sep 17 00:00:00 2001 From: Josh van Leeuwen Date: Thu, 26 Sep 2024 12:59:56 +0100 Subject: [PATCH 6/7] Update 20240919-scheduler-failure-policy.md Co-authored-by: Mike Nguyen Signed-off-by: Josh van Leeuwen --- 20240919-scheduler-failure-policy.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/20240919-scheduler-failure-policy.md b/20240919-scheduler-failure-policy.md index 0d74b60..2d5096a 100644 --- a/20240919-scheduler-failure-policy.md +++ b/20240919-scheduler-failure-policy.md @@ -29,7 +29,7 @@ To begin with, we will support 3 failure policies: 1. `Drop`: the job trigger will not be retried and the job will be marked as ticked. This is the current behaviour and will continue to be the default behaviour for all jobs. 2. `Constant`: the job will be retried as a constant time interval, up to a maximum number of retries (which could be infinite). -3. `Schedule`: the job will be retried according to a [cron scheudler](https://github.com/diagridio/go-etcd-cron/blob/2a1c6747974627691165eb96a2ca0202285d71eb/proto/job.proto#L68), up to a maximum number of retries (which could be infinite). +3. `Schedule`: the job will be retried according to a [cron scheduler](https://github.com/diagridio/go-etcd-cron/blob/2a1c6747974627691165eb96a2ca0202285d71eb/proto/job.proto#L68), up to a maximum number of retries (which could be infinite). ### Future Design From 26419398a54283c18d70a3c60288e9229a180416 Mon Sep 17 00:00:00 2001 From: joshvanl Date: Wed, 2 Oct 2024 14:22:42 +0100 Subject: [PATCH 7/7] Remove FailurePolicyCron & change default to basic FailurePolicyConstant Signed-off-by: joshvanl --- 20240919-scheduler-failure-policy.md | 26 +++++--------------------- 1 file changed, 5 insertions(+), 21 deletions(-) diff --git a/20240919-scheduler-failure-policy.md b/20240919-scheduler-failure-policy.md index 2d5096a..d5683e9 100644 --- a/20240919-scheduler-failure-policy.md +++ b/20240919-scheduler-failure-policy.md @@ -25,11 +25,10 @@ While always ticking failed jobs can be desirable behaviour, this is not always ## Design -To begin with, we will support 3 failure policies: +To begin with, we will support 2 failure policies: 1. `Drop`: the job trigger will not be retried and the job will be marked as ticked. - This is the current behaviour and will continue to be the default behaviour for all jobs. 2. `Constant`: the job will be retried as a constant time interval, up to a maximum number of retries (which could be infinite). -3. `Schedule`: the job will be retried according to a [cron scheduler](https://github.com/diagridio/go-etcd-cron/blob/2a1c6747974627691165eb96a2ca0202285d71eb/proto/job.proto#L68), up to a maximum number of retries (which could be infinite). + This policy will be the default behaviour, with a 1 second delay and 3 maximum retries. ### Future Design @@ -62,7 +61,7 @@ message Counter { ``` Below are the proto definitions for the new `FailurePolicy` options. -Both constant and cron policies include an optional max retries option to limit the number of retries according to the number of attempts. +The constant policy includes an optional max retries option to limit the number of retries according to the number of attempts. If max retries is unset, the Job will be retried indefinitely. The failure policy message is added as an optional field to the Job message. If unset, the failure policy of a Job is `Drop`. @@ -74,7 +73,6 @@ message FailurePolicy { oneof policy { FailurePolicyDrop drop = 1; FailurePolicyConstant constant = 2; - FailurePolicyCron cron = 3; } } @@ -93,20 +91,6 @@ message FailurePolicyConstant { // If unset, the Job will be retried indefinitely. optional uint32 max_retries = 2; } - -// FailurePolicyCron is a policy which retries the job according to a cron -// schedule. -message FailurePolicyCron { - // schedule is the cron schedule at which to retry the job. - // See the Job.schedule field for the format of schedule. - // Must not be empty. - string schedule = 1; - - // max_retries is the optional maximum number of retries to attempt before - // giving up. - // If unset, the Job will be retried indefinitely. - optional uint32 max_retries = 2; -} ``` ```proto @@ -115,8 +99,8 @@ message Job { // failure_policy is the optional policy to apply when a job fails to // trigger. - // By default, the failure policy is drop- meaning the Job tick will be - // dropped or ignored in the event of failure. + // By default, the failure policy is FailurePolicyConstant, with a 1s delay + // and 3 maximum retries. optional FailurePolicy failure_policy = 7; } ```