Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add component healthcheck api design #34

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

DeepanshuA
Copy link

@DeepanshuA DeepanshuA commented Jul 20, 2023

API Design, Approaches and Recommended Approach for Components Healthcheck in Dapr.

Closes #35

Signed-off-by: Deepanshu Agarwal <[email protected]>
2. It should not give any false positives or false negatives.

## Current Scenario
There are many components in Dapr which don't yet implement Ping.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment from Alessandro (@ItalyPaleAle ): I don't know if making Ping mandatory is needed. A lot of components are stateless (for example, they don't maintain persistent connections with a remove service). IMHO it's fine to include Ping in an optional interface.

Ref: https://hackmd.io/MaNpUYRyQqe-0eqCsSUiIg

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it is Optional only right now. And, that is the correct state in my opinion too. The doc also doesn't recommend it to make mandatory.

2. It should not give any false positives or false negatives.

## Current Scenario
There are many components in Dapr which don't yet implement Ping.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment from Alessandro (@ItalyPaleAle ): That said, we should review components that don't implement Ping, and see if adding it would be useful.

Ref: https://hackmd.io/MaNpUYRyQqe-0eqCsSUiIg

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed



## Open Questions
1. Is it only for kubernetes users? Is it only needed for http endpoint? Or we should cover gRPC endpoint as well?
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment from Alessandro (@ItalyPaleAle ): No, although K8s is arguably the main user of this.

Ref: https://hackmd.io/MaNpUYRyQqe-0eqCsSUiIg

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment from Alessandro (@ItalyPaleAle ): Which brings up a question: should we add an annotation to indicate which components to include in the K8s healthchecks?

Ref: https://hackmd.io/MaNpUYRyQqe-0eqCsSUiIg

Copy link
Member

@yaron2 yaron2 Jul 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure components should participate at all in K8s health checks. In k8s (and other systems) the health check removes a service from the endpoint collection for a K8s service, which means if a component is even temporarily down this might cause service invocation / actor interactions to stop working which could cause unplanned downtime and outages. Dapr health should be separated from component health.



## Open Questions
1. Is it only for kubernetes users? Is it only needed for http endpoint? Or we should cover gRPC endpoint as well?
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment from Alessandro (@ItalyPaleAle ): We don't have a gRPC healthcheck endpoint today. Not sure if that's needed?

Ref: https://hackmd.io/MaNpUYRyQqe-0eqCsSUiIg

]
}
```
**Case 3:** When SOME components in system implement Ping AND some components DON'T implement Ping, AND some components have failed Ping check as well: Here we report failed components and those components as well for which Ping is not implemented, we don't treat this API as a way to list components.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment from Alessandro (@ItalyPaleAle ): I am not sure we need a case for this. As per my other comment, IMHO it's totally fine for a component to NOT have Ping implemented. I think the healthcheck should invoke Ping when available, and otherwise just skip the component

Ref: https://hackmd.io/MaNpUYRyQqe-0eqCsSUiIg

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But, won't it give False Positives then? How as a user am I supposed to know that what all components implement Ping and that only those components are covered by this Health Check?

Signed-off-by: Deepanshu Agarwal <[email protected]>
Signed-off-by: Deepanshu Agarwal <[email protected]>
Signed-off-by: Deepanshu Agarwal <[email protected]>
Signed-off-by: Deepanshu Agarwal <[email protected]>
Signed-off-by: Deepanshu Agarwal <[email protected]>
Signed-off-by: Deepanshu Agarwal <[email protected]>
0010-R-components-healthcheck.md Outdated Show resolved Hide resolved
### Approach:
- Maintain a cache with status of all components loaded successfully and keep updating this cache in a background go routine at a configurable `pingUpdateFrequency`. By default, `pingUpdateFrequency` to be 5 minutes.

- This cache will not start to be built, right at the boot of daprd sidecar. There will be flag (let's say `collectPings`), which will be `false` at the beginning of the daprd sidecar and which will be turned `true`, once all the components are ready.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are these options passed? Are they in the Configuration CRD?

Copy link
Author

@DeepanshuA DeepanshuA Aug 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

collectPings is an Internal flag.
For pingUpdateFrequency, yeah Configuration CRD would be the place.

http://localhost:3500/v1.0/healthz?include_components=true

### Approach:
- Maintain a cache with status of all components loaded successfully and keep updating this cache in a background go routine at a configurable `pingUpdateFrequency`. By default, `pingUpdateFrequency` to be 5 minutes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some thoughts:

  • 5 mins seems a lot? That would mean the time to detect a failure can be as high as 5 mins. Maybe that's just my opinion.
  • Ping updates may be more frequent if the component is un-healthy, since we want to see if we can recover faster

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I agree that 5 minutes seems higher. I have updated that to 30 seconds.
  • I like this point but do we need another configuration for this? One more config will be too much, so rather gave logic that in case of unhealthy, it will Ping every pingUpdateFrequency / 3.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If un-healthy we should use an exponential backoff (with a limit of pingUpdateFrequency) IMHO, to avoid overloading a service.

@olitomlinson
Copy link

olitomlinson commented Jul 27, 2023

Before I put too much thought into the API design, I want to get the purpose straight in my head first....

I don't know if this is an over simplification or I've just got this plain wrong, but in my head, I had component health checks providing the same outcome as app health checks?

i.e. Read this doc on App Health checks, and in your mind, substitute the word app / application for component -- Is that what the component health checks are trying to achieve ?

If not, what are the key differences in how the sidecar would behave?


At the point of authoring user-code which depends on dapr, I would generally only care about the overall health of the sidecar, which would include the health of all the components too when determining the overall health.

If just one of the components failed their health check, I would expect the the overall health check to fail.

Although I care about the health of the components, I'm unlikely to need to know exactly which components are unhealthy at this point in my code. I'm not saying that the breakdown of component health information isn't useful in other use-cases, but I'm just not sure I would utilise that information in my user-code.

var client = new DaprClientBuilder().Build();

var isDaprReady = await client.CheckHealthAsync();

if (isDaprReady) 
{
    // Execute Dapr dependent code.
}

@msfussell
Copy link
Member

msfussell commented Jul 31, 2023

Building on the comment from @olitomlinson this proposal need clarity on what the goal is and how this works. In the update it has lost the end user viewpoint to this capability. How, why and when to use this API.

This is my understanding:

  1. A ping method is present on the component API today and components can implement this by a component author. (This is an unfortunate name since the ping method is really about component health as the package describes). IMO this should ping method should be implemented by default on every component to reflect its initialization status (success or failure). Thereby not putting the burden on future authors and reworking existing components and also prevents the ambiguity of a component not having any health status

  2. This proposal extends the Dapr side healthz endpoint https://docs.dapr.io/developing-applications/building-blocks/observability/sidecar-health/ to cascade health (aggregate status) across each of the components for the sidecar. Hence the code example above ^^ from Oli should just work.
    Question: In this proposal are you expecting to call http://localhost:3500/v1.0/healthz?include_components=true? If so now we are going to have to update each SDK to have this additional parameter on CheckHealthAsync(); and the then preferred call would then usually be CheckHealthAsync(true);or similar to include component health. Is there another way instead of updating the SDK, that you could enable component health checking?

  3. It was not clear to me from this proposal how component health was measured. However I presume that a single component reporting failed health made the health of the sidecar failed (it is cascading). This is why it is important that ping is implemented by default IMO, for component initialization status.

  4. Given that the /healthz endpoint is unsecure, this proposal hints at having the metadata endpoint return additional information about failed components, but does not say that explicitly. Is that the intention? If so, what does the structure of this information look like for each healthy and unhealthy component?

  5. I do not feel that we should be adding more information to the metadata endpoint without being clear on its usage. It is becoming a place to return all information, growing in size and most of this not useful for any given request. The metadata endpoint needs to be thought through on how it can be parameterized to return information say on "component health" or the "status of the specs like resiliency policies" or "actor types registered" etc and then it becomes more useful and explicit. This can be a separate proposal.

Signed-off-by: Deepanshu Agarwal <[email protected]>
Signed-off-by: Deepanshu Agarwal <[email protected]>
## Use-Case
If a mandatory* component fails at the start-up, Dapr will terminate or will move to some non-workable state like CrashLoopBackoff etc., so `healthz` API or any other API can't be used.

After Dapr has started, if any Mandatory component fails, this healthcheck can be used to determine what component has failed and accordingly some steps acam be undertaken.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After Dapr has started, if any Mandatory component fails, this healthcheck can be used to determine what component has failed and accordingly some steps can be undertaken.

During a period of time where the component health check is failing, what will be the effect on the sidecars operation?

Will it be the same effect as with App Health Check i.e.

Taken from the App Healthcheck Docs

When it detects a failure in the app’s health, Dapr stops accepting new work on behalf of the application by:

  • Unsubscribing from all pub/sub subscriptions
  • Stopping all input bindings
  • Short-circuiting all service-invocation requests, which terminate in the Dapr runtime and are not forwarded to the application

These changes are meant to be temporary, and Dapr resumes normal operations once it detects that the application is responsive again.

Copy link

@olitomlinson olitomlinson Aug 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just my initial take

If a component health check is failed, the sidecar should go into the same state as with a failed App Health Check.

This helps to keeps a consistent model, which then makes it easier to reason about the behaviour of the sidecar during the period of time where one or more various probes/checks are failing.

Copy link
Author

@DeepanshuA DeepanshuA Aug 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of App health, Dapr sidecar need to stop accepting events to process ( i.e. why subscriptions, input bindings or service invocation requests need to be stopped), as App itself is not Healthy and can't process these events. So, Dapr sidecar doesn't know what to do with these events.

In case of a mandatory component health being reported as unhealthy, some features or All features of this component would be already un-usable. So, App can use this piece of information in a quick way to report back this status to 1. either some automated downstream to fix this issue or 2. Devops, which may consider some manual intervention.

But, here in case of a mandatory component being unhealthy, Dapr itself will not do any other operation. Here, rather App can decide to stop sending/receiving events via this component until it comes back.

Copy link

@olitomlinson olitomlinson Aug 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my perspective, if the dapr sidecar knows that one or more mandatory components are not healthy, it makes little sense to allow the App to be invoked via PubSub/Service Invocation etc (I see this as inviting a preventable failure to occur)

Which is the same principle of App Health Checks -- if you know the App (or some downstream dependency) is not healthy, don't invoke the App.

I absolutely see that some of those components may not always be needed, depending on the code path that is taken, so to enact the same behaviour as a failing App Health Check may seem heavy handed / aggressive.

However, I prefer an aggressive health check strategy -- "the sidecar is only as healthy as its weakest link" :)


The positive side effect of all of this is it will encourage operators to ensure that Components are scoped accordingly to the Apps that depend on them, rather than having no scope and Components being applied to every App.

Signed-off-by: Deepanshu Agarwal <[email protected]>
Signed-off-by: Deepanshu Agarwal <[email protected]>
@DeepanshuA
Copy link
Author

DeepanshuA commented Aug 1, 2023

Building on the comment from @olitomlinson this proposal need clarity on what the goal is and how this works. In the update it has lost the end user viewpoint to this capability. How, why and when to use this API.

  1. Yeah, SDKs would have to include an option to pass this query param. In case of a new endpoint, this would have to be supported in SDKs from scratch.
  2. Yeah, it is cascading. Added a diagram to explain internal working - hope it helps.
  3. Added details about additional info in metadata and diagram to explain that.
  4. Added a query param for metadata consisting component health check details.

@msfussell
Copy link
Member

msfussell commented Aug 1, 2023

These update help address questions. I would still like to see this described from an end user experience at the start. What would be written in the docs? Can we include this into the proposal

I agree with the observations from @olitomlinson that the sidecar needs to do more to prevent a know unhealthy status affecting the application. This proposal places to many burdens on both the component implementer and on the app IMO with too many settings.

1)Can pIng() have a default implementation that uses the component initialization for health status?
2) Why the need for mandatory verses optional component health? This is yet another thing to configure. All components should have a default ping() and then this need goes away. Even if kept setting this via spec.ignoreErrors to True is obscure, and we should not overload one setting to mean something else.
3) IMO The health of components should be set the default state on healthz endpoint. In other words you should exclude components with a flag not include components with a flag. I see the component health being that important to the overall health of the sidecar.
4) Re @olitomlinson's comment on whether Dapr sidecar can do more here based on the component health to not send messages etc to the app. I agree here and we should discuss this more
5) Are OK and NOT_OK the best status codes for health? Should these not indicate the health status?

@JoshVanL
Copy link
Contributor

JoshVanL commented Aug 2, 2023

Thanks for the proposal! Wanted to give a couple thoughts as well but don't want to rehash good points already raised to much-

I agree with @msfussell above that Ping(context.Context) error should be a mandatory function on every component interface, even if that means it is a no-op always return nil for some types. Status should always be included in the response if we move forward with this API. This ensures feature devs don't miss the implementation. As an aside, I think this should also be true for interfaces to implement io.Closer for similar reasons.

Please make it clear in the proposal that the healthz endpoint is an unprivileged endpoint, and as such, should query the privileged metadata endpoint for richer component status information.

Rather than the component status and errorMessage fields being flat to the component struct, I think they should be a child struct. This is important for keeping the API extensible and easier to parse for clients/humans. We also probably want to return the health status by default.

type ComponentHealthz struct {
  Healthy bool `json:"healthy"`
  // StatusCode may or may not be appropriate for this Component to return.
  StatusCode *int `json:"status,omitempty"`
  // Message would be the human readable error message, or recovery message from unhealthy to healthy.
  Message *string `josn:"message,omitempty"`
}
  • To implement only http endpoint, at least for the first version of this API.

Can we not just do both? As I understand, the project is trying to move toward unifying the capabilities of both protocols anyway so this work is going to need to be done at some point.

Rather than having another config option pingUpdateFrequency, can each component not signal the appropriate infrequency of polling for its particular type? This would alleviate needing both a new config option, and the set frequency being inappropriate for some component types and not others.

@berndverst
Copy link
Member

@DeepanshuA can you please put an updated version of the proposal into the issue description now - reconciling all feedback, so that it's easier to discuss? Reading the markdown isn't the great way to go about that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add API Design for Components Healthcheck
7 participants