Add component healthcheck api design #34

DeepanshuA · 2023-07-20T08:30:07Z

API Design, Approaches and Recommended Approach for Components Healthcheck in Dapr.

Closes #35

Signed-off-by: Deepanshu Agarwal <[email protected]>

DeepanshuA · 2023-07-20T09:05:50Z

0010-R-components-healthcheck.md

+2. It should not give any false positives or false negatives.
+
+## Current Scenario
+There are many components in Dapr which don't yet implement Ping. 


Comment from Alessandro (@ItalyPaleAle ): I don't know if making Ping mandatory is needed. A lot of components are stateless (for example, they don't maintain persistent connections with a remove service). IMHO it's fine to include Ping in an optional interface.

Ref: https://hackmd.io/MaNpUYRyQqe-0eqCsSUiIg

Yeah, it is Optional only right now. And, that is the correct state in my opinion too. The doc also doesn't recommend it to make mandatory.

DeepanshuA · 2023-07-20T09:06:44Z

0010-R-components-healthcheck.md

+2. It should not give any false positives or false negatives.
+
+## Current Scenario
+There are many components in Dapr which don't yet implement Ping. 


Comment from Alessandro (@ItalyPaleAle ): That said, we should review components that don't implement Ping, and see if adding it would be useful.

Ref: https://hackmd.io/MaNpUYRyQqe-0eqCsSUiIg

DeepanshuA · 2023-07-20T09:09:28Z

0010-R-components-healthcheck.md

+
+
+## Open Questions
+1. Is it only for kubernetes users? Is it only needed for http endpoint? Or we should cover gRPC endpoint as well?


Comment from Alessandro (@ItalyPaleAle ): No, although K8s is arguably the main user of this.

Ref: https://hackmd.io/MaNpUYRyQqe-0eqCsSUiIg

Comment from Alessandro (@ItalyPaleAle ): Which brings up a question: should we add an annotation to indicate which components to include in the K8s healthchecks?

Ref: https://hackmd.io/MaNpUYRyQqe-0eqCsSUiIg

I'm not sure components should participate at all in K8s health checks. In k8s (and other systems) the health check removes a service from the endpoint collection for a K8s service, which means if a component is even temporarily down this might cause service invocation / actor interactions to stop working which could cause unplanned downtime and outages. Dapr health should be separated from component health.

DeepanshuA · 2023-07-20T09:16:07Z

0010-R-components-healthcheck.md

+
+
+## Open Questions
+1. Is it only for kubernetes users? Is it only needed for http endpoint? Or we should cover gRPC endpoint as well?


Comment from Alessandro (@ItalyPaleAle ): We don't have a gRPC healthcheck endpoint today. Not sure if that's needed?

Ref: https://hackmd.io/MaNpUYRyQqe-0eqCsSUiIg

DeepanshuA · 2023-07-20T09:17:26Z

0010-R-components-healthcheck.md

+        ]
+}
+```
+**Case 3:** When SOME components in system implement Ping AND some components DON'T implement Ping, AND some components have failed Ping check as well: Here we report failed components and those components as well for which Ping is not implemented, we don't treat this API as a way to list components.


Comment from Alessandro (@ItalyPaleAle ): I am not sure we need a case for this. As per my other comment, IMHO it's totally fine for a component to NOT have Ping implemented. I think the healthcheck should invoke Ping when available, and otherwise just skip the component

Ref: https://hackmd.io/MaNpUYRyQqe-0eqCsSUiIg

But, won't it give False Positives then? How as a user am I supposed to know that what all components implement Ping and that only those components are covered by this Health Check?

Signed-off-by: Deepanshu Agarwal <[email protected]>

0010-R-components-healthcheck.md

ItalyPaleAle · 2023-07-26T14:51:21Z

0010-R-components-healthcheck.md

+### Approach: 
+- Maintain a cache with status of all components loaded successfully and keep updating this cache in a background go routine at a configurable `pingUpdateFrequency`. By default, `pingUpdateFrequency` to be 5 minutes.
+
+- This cache will not start to be built, right at the boot of daprd sidecar. There will be flag (let's say `collectPings`), which will be `false` at the beginning of the daprd sidecar and which will be turned `true`, once all the components are ready.


How are these options passed? Are they in the Configuration CRD?

collectPings is an Internal flag.
For pingUpdateFrequency, yeah Configuration CRD would be the place.

ItalyPaleAle · 2023-07-26T14:52:13Z

0010-R-components-healthcheck.md

+http://localhost:3500/v1.0/healthz?include_components=true
+
+### Approach: 
+- Maintain a cache with status of all components loaded successfully and keep updating this cache in a background go routine at a configurable `pingUpdateFrequency`. By default, `pingUpdateFrequency` to be 5 minutes.


Some thoughts:

5 mins seems a lot? That would mean the time to detect a failure can be as high as 5 mins. Maybe that's just my opinion.

Ping updates may be more frequent if the component is un-healthy, since we want to see if we can recover faster

I agree that 5 minutes seems higher. I have updated that to 30 seconds.

I like this point but do we need another configuration for this? One more config will be too much, so rather gave logic that in case of unhealthy, it will Ping every pingUpdateFrequency / 3.

If un-healthy we should use an exponential backoff (with a limit of pingUpdateFrequency) IMHO, to avoid overloading a service.

olitomlinson · 2023-07-27T09:29:12Z

Before I put too much thought into the API design, I want to get the purpose straight in my head first....

I don't know if this is an over simplification or I've just got this plain wrong, but in my head, I had component health checks providing the same outcome as app health checks?

i.e. Read this doc on App Health checks, and in your mind, substitute the word app / application for component -- Is that what the component health checks are trying to achieve ?

If not, what are the key differences in how the sidecar would behave?

At the point of authoring user-code which depends on dapr, I would generally only care about the overall health of the sidecar, which would include the health of all the components too when determining the overall health.

If just one of the components failed their health check, I would expect the the overall health check to fail.

Although I care about the health of the components, I'm unlikely to need to know exactly which components are unhealthy at this point in my code. I'm not saying that the breakdown of component health information isn't useful in other use-cases, but I'm just not sure I would utilise that information in my user-code.

var client = new DaprClientBuilder().Build();

var isDaprReady = await client.CheckHealthAsync();

if (isDaprReady) 
{
    // Execute Dapr dependent code.
}

msfussell · 2023-07-31T23:37:48Z

Building on the comment from @olitomlinson this proposal need clarity on what the goal is and how this works. In the update it has lost the end user viewpoint to this capability. How, why and when to use this API.

This is my understanding:

A ping method is present on the component API today and components can implement this by a component author. (This is an unfortunate name since the ping method is really about component health as the package describes). IMO this should ping method should be implemented by default on every component to reflect its initialization status (success or failure). Thereby not putting the burden on future authors and reworking existing components and also prevents the ambiguity of a component not having any health status
This proposal extends the Dapr side healthz endpoint https://docs.dapr.io/developing-applications/building-blocks/observability/sidecar-health/ to cascade health (aggregate status) across each of the components for the sidecar. Hence the code example above ^^ from Oli should just work.
Question: In this proposal are you expecting to call http://localhost:3500/v1.0/healthz?include_components=true? If so now we are going to have to update each SDK to have this additional parameter on CheckHealthAsync(); and the then preferred call would then usually be CheckHealthAsync(true);or similar to include component health. Is there another way instead of updating the SDK, that you could enable component health checking?
It was not clear to me from this proposal how component health was measured. However I presume that a single component reporting failed health made the health of the sidecar failed (it is cascading). This is why it is important that ping is implemented by default IMO, for component initialization status.
Given that the /healthz endpoint is unsecure, this proposal hints at having the metadata endpoint return additional information about failed components, but does not say that explicitly. Is that the intention? If so, what does the structure of this information look like for each healthy and unhealthy component?
I do not feel that we should be adding more information to the metadata endpoint without being clear on its usage. It is becoming a place to return all information, growing in size and most of this not useful for any given request. The metadata endpoint needs to be thought through on how it can be parameterized to return information say on "component health" or the "status of the specs like resiliency policies" or "actor types registered" etc and then it becomes more useful and explicit. This can be a separate proposal.

Signed-off-by: Deepanshu Agarwal <[email protected]>

olitomlinson · 2023-08-01T10:15:58Z

0010-R-components-healthcheck.md

+## Use-Case
+If a mandatory* component fails at the start-up, Dapr will terminate or will move to some non-workable state like CrashLoopBackoff etc., so `healthz` API or any other API can't be used.
+
+After Dapr has started, if any Mandatory component fails, this healthcheck can be used to determine what component has failed and accordingly some steps acam be undertaken.


After Dapr has started, if any Mandatory component fails, this healthcheck can be used to determine what component has failed and accordingly some steps can be undertaken.

During a period of time where the component health check is failing, what will be the effect on the sidecars operation?

Will it be the same effect as with App Health Check i.e.

Taken from the App Healthcheck Docs

When it detects a failure in the app’s health, Dapr stops accepting new work on behalf of the application by:

Unsubscribing from all pub/sub subscriptions

Stopping all input bindings

Short-circuiting all service-invocation requests, which terminate in the Dapr runtime and are not forwarded to the application

These changes are meant to be temporary, and Dapr resumes normal operations once it detects that the application is responsive again.

Just my initial take

If a component health check is failed, the sidecar should go into the same state as with a failed App Health Check.

This helps to keeps a consistent model, which then makes it easier to reason about the behaviour of the sidecar during the period of time where one or more various probes/checks are failing.

In case of App health, Dapr sidecar need to stop accepting events to process ( i.e. why subscriptions, input bindings or service invocation requests need to be stopped), as App itself is not Healthy and can't process these events. So, Dapr sidecar doesn't know what to do with these events.

In case of a mandatory component health being reported as unhealthy, some features or All features of this component would be already un-usable. So, App can use this piece of information in a quick way to report back this status to 1. either some automated downstream to fix this issue or 2. Devops, which may consider some manual intervention.

But, here in case of a mandatory component being unhealthy, Dapr itself will not do any other operation. Here, rather App can decide to stop sending/receiving events via this component until it comes back.

From my perspective, if the dapr sidecar knows that one or more mandatory components are not healthy, it makes little sense to allow the App to be invoked via PubSub/Service Invocation etc (I see this as inviting a preventable failure to occur)

Which is the same principle of App Health Checks -- if you know the App (or some downstream dependency) is not healthy, don't invoke the App.

I absolutely see that some of those components may not always be needed, depending on the code path that is taken, so to enact the same behaviour as a failing App Health Check may seem heavy handed / aggressive.

However, I prefer an aggressive health check strategy -- "the sidecar is only as healthy as its weakest link" :)

The positive side effect of all of this is it will encourage operators to ensure that Components are scoped accordingly to the Apps that depend on them, rather than having no scope and Components being applied to every App.

Signed-off-by: Deepanshu Agarwal <[email protected]>

DeepanshuA · 2023-08-01T12:26:59Z

Building on the comment from @olitomlinson this proposal need clarity on what the goal is and how this works. In the update it has lost the end user viewpoint to this capability. How, why and when to use this API.

Yeah, SDKs would have to include an option to pass this query param. In case of a new endpoint, this would have to be supported in SDKs from scratch.
Yeah, it is cascading. Added a diagram to explain internal working - hope it helps.
Added details about additional info in metadata and diagram to explain that.
Added a query param for metadata consisting component health check details.

msfussell · 2023-08-01T16:03:40Z

These update help address questions. I would still like to see this described from an end user experience at the start. What would be written in the docs? Can we include this into the proposal

I agree with the observations from @olitomlinson that the sidecar needs to do more to prevent a know unhealthy status affecting the application. This proposal places to many burdens on both the component implementer and on the app IMO with too many settings.

1)Can pIng() have a default implementation that uses the component initialization for health status?
2) Why the need for mandatory verses optional component health? This is yet another thing to configure. All components should have a default ping() and then this need goes away. Even if kept setting this via spec.ignoreErrors to True is obscure, and we should not overload one setting to mean something else.
3) IMO The health of components should be set the default state on healthz endpoint. In other words you should exclude components with a flag not include components with a flag. I see the component health being that important to the overall health of the sidecar.
4) Re @olitomlinson's comment on whether Dapr sidecar can do more here based on the component health to not send messages etc to the app. I agree here and we should discuss this more
5) Are OK and NOT_OK the best status codes for health? Should these not indicate the health status?

JoshVanL · 2023-08-02T12:21:42Z

Thanks for the proposal! Wanted to give a couple thoughts as well but don't want to rehash good points already raised to much-

I agree with @msfussell above that Ping(context.Context) error should be a mandatory function on every component interface, even if that means it is a no-op always return nil for some types. Status should always be included in the response if we move forward with this API. This ensures feature devs don't miss the implementation. As an aside, I think this should also be true for interfaces to implement io.Closer for similar reasons.

Please make it clear in the proposal that the healthz endpoint is an unprivileged endpoint, and as such, should query the privileged metadata endpoint for richer component status information.

Rather than the component status and errorMessage fields being flat to the component struct, I think they should be a child struct. This is important for keeping the API extensible and easier to parse for clients/humans. We also probably want to return the health status by default.

type ComponentHealthz struct {
  Healthy bool `json:"healthy"`
  // StatusCode may or may not be appropriate for this Component to return.
  StatusCode *int `json:"status,omitempty"`
  // Message would be the human readable error message, or recovery message from unhealthy to healthy.
  Message *string `josn:"message,omitempty"`
}

To implement only http endpoint, at least for the first version of this API.

Can we not just do both? As I understand, the project is trying to move toward unifying the capabilities of both protocols anyway so this work is going to need to be done at some point.

Rather than having another config option pingUpdateFrequency, can each component not signal the appropriate infrequency of polling for its particular type? This would alleviate needing both a new config option, and the set frequency being inappropriate for some component types and not others.

berndverst · 2023-10-03T20:22:35Z

@DeepanshuA can you please put an updated version of the proposal into the issue description now - reconciling all feedback, so that it's easier to discuss? Reading the markdown isn't the great way to go about that.

Add component healthcheck api design

cd99db6

Signed-off-by: Deepanshu Agarwal <[email protected]>

DeepanshuA commented Jul 20, 2023

View reviewed changes

DeepanshuA added 6 commits July 20, 2023 20:56

Revision 2

14e30a2

Signed-off-by: Deepanshu Agarwal <[email protected]>

Revision 3

9ee3a92

Signed-off-by: Deepanshu Agarwal <[email protected]>

Revision 4

22d9b7e

Signed-off-by: Deepanshu Agarwal <[email protected]>

Revision 5

5a32ea2

Signed-off-by: Deepanshu Agarwal <[email protected]>

Incorporate Thursday Forum Comments

82a5060

Signed-off-by: Deepanshu Agarwal <[email protected]>

Spelling

f92c475

Signed-off-by: Deepanshu Agarwal <[email protected]>

ItalyPaleAle reviewed Jul 26, 2023

View reviewed changes

msfussell mentioned this pull request Jul 14, 2023

Allow components to perform custom health checks dapr/docs#2406

Open

DeepanshuA added 2 commits August 1, 2023 15:05

Update usecase

ea97fc8

Signed-off-by: Deepanshu Agarwal <[email protected]>

Update usecase

7a84066

Signed-off-by: Deepanshu Agarwal <[email protected]>

olitomlinson reviewed Aug 1, 2023

View reviewed changes

DeepanshuA added 2 commits August 1, 2023 15:57

Update Internal Working diagram

fe5e04b

Signed-off-by: Deepanshu Agarwal <[email protected]>

Update metadata API details

356976a

Signed-off-by: Deepanshu Agarwal <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add component healthcheck api design #34

Add component healthcheck api design #34

DeepanshuA commented Jul 20, 2023 •

edited

Loading

DeepanshuA Jul 20, 2023

DeepanshuA Jul 20, 2023

DeepanshuA Jul 20, 2023

DeepanshuA Jul 20, 2023

DeepanshuA Jul 20, 2023

DeepanshuA Jul 20, 2023

yaron2 Jul 26, 2023 •

edited

Loading

DeepanshuA Jul 20, 2023

DeepanshuA Jul 20, 2023

DeepanshuA Jul 20, 2023

ItalyPaleAle Jul 26, 2023

DeepanshuA Aug 1, 2023 •

edited

Loading

ItalyPaleAle Jul 26, 2023

DeepanshuA Aug 1, 2023

ItalyPaleAle Aug 1, 2023

olitomlinson commented Jul 27, 2023 •

edited

Loading

msfussell commented Jul 31, 2023 •

edited

Loading

olitomlinson Aug 1, 2023

olitomlinson Aug 1, 2023 •

edited

Loading

DeepanshuA Aug 1, 2023 •

edited

Loading

olitomlinson Aug 1, 2023 •

edited

Loading

DeepanshuA commented Aug 1, 2023 •

edited

Loading

msfussell commented Aug 1, 2023 •

edited

Loading

JoshVanL commented Aug 2, 2023

berndverst commented Oct 3, 2023



		## Open Questions
		1. Is it only for kubernetes users? Is it only needed for http endpoint? Or we should cover gRPC endpoint as well?

Add component healthcheck api design #34

Are you sure you want to change the base?

Add component healthcheck api design #34

Conversation

DeepanshuA commented Jul 20, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yaron2 Jul 26, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DeepanshuA Aug 1, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

olitomlinson commented Jul 27, 2023 • edited Loading

msfussell commented Jul 31, 2023 • edited Loading

Choose a reason for hiding this comment

olitomlinson Aug 1, 2023 • edited Loading

Choose a reason for hiding this comment

DeepanshuA Aug 1, 2023 • edited Loading

Choose a reason for hiding this comment

olitomlinson Aug 1, 2023 • edited Loading

Choose a reason for hiding this comment

DeepanshuA commented Aug 1, 2023 • edited Loading

msfussell commented Aug 1, 2023 • edited Loading

JoshVanL commented Aug 2, 2023

berndverst commented Oct 3, 2023

DeepanshuA commented Jul 20, 2023 •

edited

Loading

yaron2 Jul 26, 2023 •

edited

Loading

DeepanshuA Aug 1, 2023 •

edited

Loading

olitomlinson commented Jul 27, 2023 •

edited

Loading

msfussell commented Jul 31, 2023 •

edited

Loading

olitomlinson Aug 1, 2023 •

edited

Loading

DeepanshuA Aug 1, 2023 •

edited

Loading

olitomlinson Aug 1, 2023 •

edited

Loading

DeepanshuA commented Aug 1, 2023 •

edited

Loading

msfussell commented Aug 1, 2023 •

edited

Loading