-
Notifications
You must be signed in to change notification settings - Fork 516
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
host-ctr: Implement exponential backoff for image pulls #433
Conversation
containerd.WithSchema1Conversion) | ||
if err != nil { | ||
return nil, errors.Wrap(err, "Failed to pull ctr image") | ||
// Retry with exponential backoff when failures occur, with 10 attempts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most folks would expect to see a fairly common implementation reused such as https://github.com/cenkalti/backoff, but this looks small enough to me to stand alone perfectly well in this use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎃
62265a1
to
61c3cc9
Compare
Addresses @jahkeup 's comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we might want to handle the contexts throughout main to respond to signals thoroughly.
61c3cc9
to
24633c1
Compare
Addresses @jahkeup 's comment regarding signal channel.
|
24633c1
to
6d9636e
Compare
Updates maximum retry interval from 32 seconds to 30 seconds. |
6d9636e
to
4f7e5cf
Compare
Sorry, I misunderstood the comment about the duration. Updated to guarantee maximum retry duration won't exceed 31 seconds (1+2+4+8+16). |
4f7e5cf
to
0b08876
Compare
Addresses @jahkeup 's comments
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple nits, but otherwise looks great!! 🐻
0b08876
to
788182e
Compare
Addresses @jahkeup comments. |
Some thoughts:
It's useful for In the context of a prolonged outage, a retry pattern like |
788182e
to
8624071
Compare
Rebase develop. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bcressey I think we should lift some of the handling into the unit, we can modify the RestartSec
directive to specify a larger period of time to allow for retries. @etungsten we can also use a well-known returnable error (with var retriesError = ...
or similar) to detect a retry error and sleep before exiting to add that jitter.
Either way, I think to make this reliable we'll also want to further designate errors to be handled to determine transient failures (where possible) from permanent failures (eg: user specified a bad image). This can be a fast follow item IMO. I think this already gets us to a better standing in terms of transient errors.
It seems to me that there isn't a good way to match on types of errors to determine if a failure was transient or not. |
8624071
to
cef01ec
Compare
Yeah, but I think to address @bcressey's concerns, we'd want to increase that and/or add some delay to the to-be-errored exit to make |
b352492
to
03eb767
Compare
|
03eb767
to
7e80672
Compare
Increases |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the current code paths are okay even with the small comments I have, if you have a chance to make these changes now please do. If we need to move on, please track em to fix! 👍
Implements exponential backoff for image pulls. Also move signal handling channel set up to after image pull happens since its mostly relevant to only when we're starting a container task.
7e80672
to
c52ba69
Compare
Addresses @jahkeup 's comments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
⛵ 💾 👍
return nil, errors.Wrap(err, "retries exhausted") | ||
} | ||
// Add a random jitter between 2 - 6 seconds to the retry interval | ||
retryIntervalWithJitter := retryInterval + time.Duration(rand.Int31n(jitterPeakAmplitude))*time.Millisecond + jitterLowerBound*time.Millisecond |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this should use types in the const where possible, but is small enough in scope that this isn't needed at the moment. Let's carry on, let's get this on a host!
Issue #, if available: Implements #431
Description of changes:
go fmt
Testing done:
Locally testing
host-ctr
with bad/invalid image:With good image:
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.