Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

leaderelection: configure all timeouts via environment variables #305

Merged
merged 5 commits into from
Sep 5, 2023

Conversation

moio
Copy link
Contributor

@moio moio commented Jun 19, 2023

We have observed in fleet-agent user cases (SURE-6125/fleet#1465) that the leader election can be lost due to timeouts - not because pods are not working, but because the control plane is overwhelmed by load.

While of course having an overloaded control plane is a problem in and of itself, losing the election means that Wrangler-based pods will restart, which in turn means a new full listing of all interesting resources, creating additional load and exacerbating symptoms.

Ideally, the situation should improve when API Priority and Fairness goes out of beta and becomes commonplace - as it should ensure leader election API calls get served with higher priority, thus returning correctly within reasonable time frames even if the control plane is overloaded.

Meantime, pretty much the only other pragmatic option left is to give election more time before failure.

This PR allows to set all critical parameters via environment variables - with default values staying the same as before.

Note: this will also address (SURE-3805/#1491)

@moio moio requested a review from rmweir June 20, 2023 07:35
moio added a commit to rancher/fleet that referenced this pull request Jun 20, 2023
CATTLE_ELECTION_LEASE_DURATION
CATTLE_ELECTION_RENEW_DEADLINE
CATTLE_ELECTION_RETRY_PERIOD

See: rancher/wrangler#305

Signed-off-by: Silvio Moioli <[email protected]>
Copy link
Contributor

@rmweir rmweir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nit, otherwise LGTM.

pkg/leader/leader.go Outdated Show resolved Hide resolved
@moio moio force-pushed the configurable_leader_election branch from fb22944 to 57d388e Compare June 22, 2023 08:48
@moio
Copy link
Contributor Author

moio commented Jun 22, 2023

@rmweir fixed. Do changes in wrangler require two approvers? If so, who could be the right person?

@rmweir rmweir requested a review from KevinJoiner June 26, 2023 16:32
Copy link
Contributor

@rmweir rmweir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rmweir
Copy link
Contributor

rmweir commented Jun 26, 2023

@moio yes, I assigned @KevinJoiner. Once approved it will be good to merge.

pkg/leader/leader.go Outdated Show resolved Hide resolved
@KevinJoiner
Copy link
Contributor

Also, I don't know what 🎯 policy is for Unit tests.

@moio
Copy link
Contributor Author

moio commented Jul 20, 2023

Well, the Bullseye Team does not have a policy at this point.

There is a guiding principle, which is to adapt to each Team's patterns/habits/processes/policies to the extent reasonable and possible - the goal being to help as good as possible.

Question becomes: would you normally add (or ask to add) tests in a case like this? If so, how would you propose to set them up?

Currently code is basically a "main" method and the effect of setting environment variables is only time-detectable - not really amenable to unit tests. One possibility would be to refactor out the main added block:

https://github.com/rancher/wrangler/pull/305/files#diff-2b8e7c2af1963d0a67b49eb0af1b1107583a195b328f40db0fc28970d7ba4dc9R54-R77

to a separate function returning the three values or err, then test it the various error conditions. To my personal sensibility, that would not be worth the cost/benefit tradeoff, but you as the long-term maintainers are the ones to set the bar for contributions, so I will respect whatever decision you take here.

@moio
Copy link
Contributor Author

moio commented Aug 1, 2023

@KevinJoiner do you have a comment on the last update?

@olblak
Copy link
Member

olblak commented Aug 2, 2023

@KevinJoiner , @rmweir What's the current state here?
By looking at the discussion history, it seems that this issue is stalled for a month now

@moio moio requested a review from KevinJoiner August 23, 2023 08:31
pkg/leader/leader.go Outdated Show resolved Hide resolved
@moio moio force-pushed the configurable_leader_election branch from 26372af to 7c35dd2 Compare August 28, 2023 09:40
@moio
Copy link
Contributor Author

moio commented Aug 28, 2023

@KevinJoiner: rebased, refactored per your suggestion, added unit tests. Please review again.

@moio moio requested a review from KevinJoiner August 28, 2023 09:41
pkg/leader/leader_test.go Show resolved Hide resolved
pkg/leader/leader.go Outdated Show resolved Hide resolved
@moio moio requested a review from KevinJoiner August 28, 2023 14:15
pkg/leader/leader_test.go Outdated Show resolved Hide resolved
Signed-off-by: Silvio Moioli <[email protected]>
@moio moio force-pushed the configurable_leader_election branch from eadda4d to ef1a21e Compare August 30, 2023 08:25
@moio moio requested a review from KevinJoiner August 30, 2023 08:26
@rmweir rmweir merged commit ea96805 into rancher:master Sep 5, 2023
1 check passed
@moio
Copy link
Contributor Author

moio commented Sep 29, 2023

@rmweir do we already know when a new version of wrangler could be tagged? I am asking because of a fleet issue that would benefit from this change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants