-
Notifications
You must be signed in to change notification settings - Fork 59
template: wait for pod to teardown (if container is present) during delete #353
base: master
Are you sure you want to change the base?
Conversation
8f8befa
to
3e3d553
Compare
3e3d553
to
37218ef
Compare
Created a better version of this:
I didn't get a chance to test this yet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The structure is better, but I think the fundamental problem stays (more inline).
Plus, I caught myself wondering - what problem does this actually solve? If we wait for a pod to be deleted, what good is waiting for its containers to terminate? Can you describe the problem that this PR would prevent?
ci-operator would wait for 300 secs only. If teardown didn't finish by that time the pod would be removed: leftover artifacts (usually Route53 records) would remain and cause issues on next retest |
/hold I can't come up with a way to test this yet |
/cc @stevekuznetsov |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm missing something here -- the pod actually being deleted and gone from the API server is a stronger requirement than the teardown
container inside of it being terminated. Why are we making this change?
When test gets cancelled - a new commit pushed in rehearse tests for instance - ci-operator would send termination signal and wait for pod to be gone for 5 mins. In most install tests teardown + artifacts take longer than 5 mins. This change would wait longer if pod has See also #353 (comment) |
OK, makes sense. Why not start a watch? |
abb16d4
to
3448245
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: vrutkovs The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
3448245
to
3211bf0
Compare
Reworked this to leverage polls and watches:
|
54c3b8a
to
4993c4a
Compare
71b5c95
to
1f2f01f
Compare
1f2f01f
to
b879a2a
Compare
b879a2a
to
6adf813
Compare
LGTM, let's give @stevekuznetsov a chance to review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add unit tests for these functions?
pkg/steps/template.go
Outdated
return nil | ||
} | ||
|
||
// Check that pod with this name exists and has the same UID |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is misleading
pkg/steps/template.go
Outdated
time.Sleep(2 * time.Second) | ||
|
||
for _, status := range append(append([]coreapi.ContainerStatus{}, pod.Status.InitContainerStatuses...), pod.Status.ContainerStatuses...) { | ||
if status.Name == "teardown" && status.State.Terminated != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is teardown
ever an initcontainer
?
pkg/steps/template.go
Outdated
timeout := 5 * time.Minute | ||
|
||
log.Printf("Waiting for pod %s to complete teardown ...", name) | ||
wait.Poll(10*time.Second, timeout, func() (done bool, err error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We used to poll ever 2s -- why change?
pkg/steps/template.go
Outdated
func waitForPodDeletion(podClient coreclientset.PodInterface, name string, uid types.UID) error { | ||
timeout := 5 * time.Minute | ||
pod, err := checkPodExistsAndValid(podClient, name, uid) | ||
if err != nil || pod == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the case that err == nil
but pod ==nil
why do you return a nil
err here? please leave a comment
pkg/steps/template.go
Outdated
return pod, nil | ||
} | ||
|
||
func waitForPodDeletion(podClient coreclientset.PodInterface, name string, uid types.UID) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
waitForPodDeletion
is no longer valid as a name -- were all callers expecting this new behavior?
pkg/steps/template.go
Outdated
} | ||
} | ||
|
||
watcher, err := podClient.Watch(meta.ListOptions{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're setting up a watch, why not just use it for all of the interaction? Why the poll?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think container terminate status can be watched, can it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not? You'd get any changes to PodStatus if I understand Watches correctly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(watch the Pod, not the container)
pkg/steps/template.go
Outdated
|
||
return fmt.Errorf("waited for pod %s deletion for %ds, was not deleted", name, timeout) | ||
log.Printf("Waiting for pod %s to be deleted in %d seconds", name, timeout) | ||
_, err = watch.Until(timeout, watcher, func(event watch.Event) (done bool, err error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why a 5min watch for deleted after a 5min retry on the container step?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Artifacts upload also take time to complete
pkg/steps/template.go
Outdated
for _, status := range append(append([]coreapi.ContainerStatus{}, pod.Status.InitContainerStatuses...), pod.Status.ContainerStatuses...) { | ||
names = append(names, status.Name) | ||
} | ||
sort.Strings(names) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
pkg/steps/template.go
Outdated
|
||
// Attempts to wait for teardown to complete | ||
containerNames := podContainerNames(pod) | ||
if sort.SearchStrings(containerNames, "teardown") < len(containerNames) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I like if sets.NewString(containerNames).Has("teardown")
a lot more than these types of manipulations
f758b7d
to
697f542
Compare
Simplified this:
|
pkg/steps/template.go
Outdated
// pod was deleted | ||
return true, nil | ||
case watch.Added, watch.Modified: | ||
if hasTeardownContainer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this logic. If we have a teardown
container, we will exit out early every time. If we don't have a teardown
container, we set this boolean to false
. The comment says that will avoid re-checking, but in reality that means we check every time. Then, if the teardown
container is terminated, you signal you are done, so the watch ends. I think we just need a dead-simple watch, or two watches. If you want to have one watch with a variable timeout, just wait for the deletion. If you want to wait for the teardown
container completion and the pod deletion separately, you will want separate watches.
In general, if the issue was a too-short timeout that cut the teardown
container short, why not just make this watch go on for an hour? In what cases do we not want to wait for the Pod to really be gone?
Also, if you look at the implementations in the build utils, we want a list then a watch with retires to handle transient errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have a teardown container, we will exit out early every time
Fixed by introducing teardownFinished
var
If you want to wait for the teardown container completion and the pod deletion separately, you will want separate watches.
That was my initial idea (see f758b7d), however there is short time (between teardown watch and pod to be deleted watch) where pod may be destroyed and could have been replaced with a new pod. Two watches don't seem reliable to me
why not just make this watch go on for an hour? In what cases do we not want to wait for the Pod to really be gone?
That would hide potential issues in teardown
we want a list then a watch with retires to handle transient errors.
Using event, ok := <-watcher.ResultChan()
? It doesn't seem to have some kind of timeout
…elete Pod teardown may take longer than 5 mins (default ci-operator timeout). This commit would ensure the timeout is extended to wait for teardown container to complete This is useful for rehearse jobs, which reuse the namespace when testing a new commit
697f542
to
2a10196
Compare
Which? Before we spend more time working on an implementation can we determine why the (stupid simple) approach of doing more retries over a 10, 20, 30 minute period would not be appropriate? Do we have some SLA for teardown time? |
Extending the timeout is the simplest approach, and its valid, however it would apply to all ci-operator pods. e2e-aws's teardown is the only one I know of which takes longer than 5 mins at the moment, other types of tests may rely on existing timeout. This PR is just one possible way, of course. If it looks overcomplicated then lets just bump the timeout on teardown to fix rehearse failures at least |
Of course it would hit all pods, but we poll every 2 seconds right now, so the only cases where increasing the timeout would actually increase the time taken for the test to run is if the pod is not returning in the current timeout, and then it would only increase it by the time taken to finish tearing down, right? |
Right, it appears a larger timeout would act the same. Created #358 instead. I'll keep this open for now |
Pod teardown may take longer than 5 mins (default ci-operator timeout).
This commit would ensure the same timeout is applied to the teardown
container - and then applied to the pod again.
This is useful for rehearse jobs, which reuse the namespace when testing
a new commit.
TODO:
Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1707486