Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

track whether a Stage was ever healthy at any point following each promotion #2847

Open
wmiller112 opened this issue Oct 25, 2024 · 3 comments

Comments

@wmiller112
Copy link
Contributor

Description

Currently, if a stage is responsible for syncing ArgoCD apps, those apps must be healthy before either:

  1. Analysis can begin for a promotion - if the stage has a configured analysis
  2. New promotion can begin - if the stage does not have configured analysis

This is problematic for projects with stages that handle applications that auto scale, as the applications frequently go into a progressing state as they scale. That progressing state is observer by the kargo controller whether or not the app already reported healthy for the given promotion, blocking new promotions. This effectively means ability to deploy is dependent on an apps load. Breaking down stages to handle less applications is one option, but even a single app could have a deployment scaling from a few hundred to a few thousand pods throughout a day.

I wonder if it might make sense to track that a given app has reported healthy to a specific promotion, and then not allow it to go back to progressing? It could potentially just ignore further progressing status, but still consider other statuses like error, unknown, etc. if the concern is that the app becomes unhealthy after the rollout completes.

@krancour
Copy link
Member

krancour commented Oct 25, 2024

The determination that the next Promotion may proceed or not looks at verification status and, absent any verification process, looks at Stage health. Stage health is partially informed by Application state(s), but the decision to proceed to the next promotion or not is entirely unaware of what's going on with Argo CD.

i.e. This issue becomes simpler (only in relative terms) if we leave Argo CD out of the equation. (If it seems like I'm being pedantic about this, it's only because we're trying very hard decouple most of Kargo from Argo CD. In an ideal world, the promotion step that updates Argo CD Apps would be the only component of Kargo with any Argo CD awareness.)

What you are proposing is something I had, in fact, considered at one point. My approach had simply been that in the absence of any user-defined verification process that would leave behind a VerificationInfo, we'd leave one behind the first time the Stage reached a healthy state following a Promotion. In this way, the question of whether to proceed with the next Promotion or not would be simplified to just, "Do we have any verification results yet?" I ended up discarding that approach because I thought it too likely that users who didn't define any verification process would feel confused seeing verification results anyway. Now I'm back to wondering if that wouldn't be so bad.

I believe @hiddeco is actively working on refactoring this bit of code, which I've mentioned is quite complex and difficult to reason over. I'd like him to weigh in on this how this may be best resolved.

To be clear, I believe we do need to better account for the scenario you described, I am just not positive what that looks like yet.

@krancour krancour changed the title Record ArgoCD App Health for Promotion track whether a Stage was ever healthy as some point following each promotion Oct 25, 2024
@krancour krancour changed the title track whether a Stage was ever healthy as some point following each promotion track whether a Stage was ever healthy at any point following each promotion Oct 25, 2024
@wmiller112
Copy link
Contributor Author

That all makes sense, thank you for the breakdown. I could definitely see that being confusing, but I guess the flip side is the exact opposite, which is what prompted my comment here - basically "We're not waiting on any verification, why isn't this promotion running?". I'll follow along as that work progresses, and happy to test any proposed solutions as that comes together.

@hiddeco hiddeco self-assigned this Oct 26, 2024
@hiddeco
Copy link
Contributor

hiddeco commented Oct 26, 2024

Next week I will be in a better position to form a decent opinion on what options are worth looking into, but what I can already say is that verification and health checks do need to change in some way to make things more pleasant in multiple scenarios.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants