-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add component healthcheck api design #34
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Deepanshu Agarwal <[email protected]>
2. It should not give any false positives or false negatives. | ||
|
||
## Current Scenario | ||
There are many components in Dapr which don't yet implement Ping. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment from Alessandro (@ItalyPaleAle ): I don't know if making Ping mandatory is needed. A lot of components are stateless (for example, they don't maintain persistent connections with a remove service). IMHO it's fine to include Ping in an optional interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it is Optional only right now. And, that is the correct state in my opinion too. The doc also doesn't recommend it to make mandatory.
2. It should not give any false positives or false negatives. | ||
|
||
## Current Scenario | ||
There are many components in Dapr which don't yet implement Ping. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment from Alessandro (@ItalyPaleAle ): That said, we should review components that don't implement Ping, and see if adding it would be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed
0010-R-components-healthcheck.md
Outdated
|
||
|
||
## Open Questions | ||
1. Is it only for kubernetes users? Is it only needed for http endpoint? Or we should cover gRPC endpoint as well? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment from Alessandro (@ItalyPaleAle ): No, although K8s is arguably the main user of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment from Alessandro (@ItalyPaleAle ): Which brings up a question: should we add an annotation to indicate which components to include in the K8s healthchecks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure components should participate at all in K8s health checks. In k8s (and other systems) the health check removes a service from the endpoint collection for a K8s service, which means if a component is even temporarily down this might cause service invocation / actor interactions to stop working which could cause unplanned downtime and outages. Dapr health should be separated from component health.
0010-R-components-healthcheck.md
Outdated
|
||
|
||
## Open Questions | ||
1. Is it only for kubernetes users? Is it only needed for http endpoint? Or we should cover gRPC endpoint as well? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment from Alessandro (@ItalyPaleAle ): We don't have a gRPC healthcheck endpoint today. Not sure if that's needed?
0010-R-components-healthcheck.md
Outdated
] | ||
} | ||
``` | ||
**Case 3:** When SOME components in system implement Ping AND some components DON'T implement Ping, AND some components have failed Ping check as well: Here we report failed components and those components as well for which Ping is not implemented, we don't treat this API as a way to list components. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment from Alessandro (@ItalyPaleAle ): I am not sure we need a case for this. As per my other comment, IMHO it's totally fine for a component to NOT have Ping implemented. I think the healthcheck should invoke Ping when available, and otherwise just skip the component
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But, won't it give False Positives then? How as a user am I supposed to know that what all components implement Ping and that only those components are covered by this Health Check?
Signed-off-by: Deepanshu Agarwal <[email protected]>
Signed-off-by: Deepanshu Agarwal <[email protected]>
Signed-off-by: Deepanshu Agarwal <[email protected]>
Signed-off-by: Deepanshu Agarwal <[email protected]>
Signed-off-by: Deepanshu Agarwal <[email protected]>
Signed-off-by: Deepanshu Agarwal <[email protected]>
0010-R-components-healthcheck.md
Outdated
### Approach: | ||
- Maintain a cache with status of all components loaded successfully and keep updating this cache in a background go routine at a configurable `pingUpdateFrequency`. By default, `pingUpdateFrequency` to be 5 minutes. | ||
|
||
- This cache will not start to be built, right at the boot of daprd sidecar. There will be flag (let's say `collectPings`), which will be `false` at the beginning of the daprd sidecar and which will be turned `true`, once all the components are ready. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are these options passed? Are they in the Configuration CRD?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
collectPings
is an Internal flag.
For pingUpdateFrequency
, yeah Configuration CRD would be the place.
0010-R-components-healthcheck.md
Outdated
http://localhost:3500/v1.0/healthz?include_components=true | ||
|
||
### Approach: | ||
- Maintain a cache with status of all components loaded successfully and keep updating this cache in a background go routine at a configurable `pingUpdateFrequency`. By default, `pingUpdateFrequency` to be 5 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some thoughts:
- 5 mins seems a lot? That would mean the time to detect a failure can be as high as 5 mins. Maybe that's just my opinion.
- Ping updates may be more frequent if the component is un-healthy, since we want to see if we can recover faster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I agree that 5 minutes seems higher. I have updated that to 30 seconds.
- I like this point but do we need another configuration for this? One more config will be too much, so rather gave logic that in case of unhealthy, it will Ping every
pingUpdateFrequency
/ 3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If un-healthy we should use an exponential backoff (with a limit of pingUpdateFrequency
) IMHO, to avoid overloading a service.
Before I put too much thought into the API design, I want to get the purpose straight in my head first.... I don't know if this is an over simplification or I've just got this plain wrong, but in my head, I had i.e. Read this doc on If not, what are the key differences in how the sidecar would behave? At the point of authoring user-code which depends on dapr, I would generally only care about the overall health of the sidecar, which would include the health of all the components too when determining the overall health. If just one of the components failed their health check, I would expect the the overall health check to fail. Although I care about the health of the components, I'm unlikely to need to know exactly which components are unhealthy at this point in my code. I'm not saying that the breakdown of component health information isn't useful in other use-cases, but I'm just not sure I would utilise that information in my user-code. var client = new DaprClientBuilder().Build();
var isDaprReady = await client.CheckHealthAsync();
if (isDaprReady)
{
// Execute Dapr dependent code.
} |
Building on the comment from @olitomlinson this proposal need clarity on what the goal is and how this works. In the update it has lost the end user viewpoint to this capability. How, why and when to use this API. This is my understanding:
|
Signed-off-by: Deepanshu Agarwal <[email protected]>
Signed-off-by: Deepanshu Agarwal <[email protected]>
0010-R-components-healthcheck.md
Outdated
## Use-Case | ||
If a mandatory* component fails at the start-up, Dapr will terminate or will move to some non-workable state like CrashLoopBackoff etc., so `healthz` API or any other API can't be used. | ||
|
||
After Dapr has started, if any Mandatory component fails, this healthcheck can be used to determine what component has failed and accordingly some steps acam be undertaken. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After Dapr has started, if any Mandatory component fails, this healthcheck can be used to determine what component has failed and accordingly some steps can be undertaken.
During a period of time where the component health check is failing, what will be the effect on the sidecars operation?
Will it be the same effect as with App Health Check
i.e.
Taken from the App Healthcheck Docs
When it detects a failure in the app’s health, Dapr stops accepting new work on behalf of the application by:
- Unsubscribing from all pub/sub subscriptions
- Stopping all input bindings
- Short-circuiting all service-invocation requests, which terminate in the Dapr runtime and are not forwarded to the application
These changes are meant to be temporary, and Dapr resumes normal operations once it detects that the application is responsive again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just my initial take
If a component health check is failed, the sidecar should go into the same state as with a failed App Health Check
.
This helps to keeps a consistent model, which then makes it easier to reason about the behaviour of the sidecar during the period of time where one or more various probes/checks are failing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case of App health, Dapr sidecar need to stop accepting events to process ( i.e. why subscriptions, input bindings or service invocation requests need to be stopped), as App itself is not Healthy and can't process these events. So, Dapr sidecar doesn't know what to do with these events.
In case of a mandatory component health being reported as unhealthy, some features or All features of this component would be already un-usable. So, App can use this piece of information in a quick way to report back this status to 1. either some automated downstream to fix this issue or 2. Devops, which may consider some manual intervention.
But, here in case of a mandatory component being unhealthy, Dapr itself will not do any other operation. Here, rather App can decide to stop sending/receiving events via this component until it comes back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my perspective, if the dapr sidecar knows that one or more mandatory components are not healthy, it makes little sense to allow the App to be invoked via PubSub/Service Invocation etc (I see this as inviting a preventable failure to occur)
Which is the same principle of App Health Checks -- if you know the App (or some downstream dependency) is not healthy, don't invoke the App.
I absolutely see that some of those components may not always be needed, depending on the code path that is taken, so to enact the same behaviour as a failing App Health Check may seem heavy handed / aggressive.
However, I prefer an aggressive health check strategy -- "the sidecar is only as healthy as its weakest link" :)
The positive side effect of all of this is it will encourage operators to ensure that Components are scoped
accordingly to the Apps that depend on them, rather than having no scope and Components being applied to every App.
Signed-off-by: Deepanshu Agarwal <[email protected]>
Signed-off-by: Deepanshu Agarwal <[email protected]>
|
These update help address questions. I would still like to see this described from an end user experience at the start. What would be written in the docs? Can we include this into the proposal I agree with the observations from @olitomlinson that the sidecar needs to do more to prevent a know unhealthy status affecting the application. This proposal places to many burdens on both the component implementer and on the app IMO with too many settings. 1)Can pIng() have a default implementation that uses the component initialization for health status? |
Thanks for the proposal! Wanted to give a couple thoughts as well but don't want to rehash good points already raised to much- I agree with @msfussell above that Please make it clear in the proposal that the Rather than the component type ComponentHealthz struct {
Healthy bool `json:"healthy"`
// StatusCode may or may not be appropriate for this Component to return.
StatusCode *int `json:"status,omitempty"`
// Message would be the human readable error message, or recovery message from unhealthy to healthy.
Message *string `josn:"message,omitempty"`
}
Can we not just do both? As I understand, the project is trying to move toward unifying the capabilities of both protocols anyway so this work is going to need to be done at some point. Rather than having another config option |
@DeepanshuA can you please put an updated version of the proposal into the issue description now - reconciling all feedback, so that it's easier to discuss? Reading the markdown isn't the great way to go about that. |
API Design, Approaches and Recommended Approach for Components Healthcheck in Dapr.
Closes #35