Skip to content
This repository has been archived by the owner on Jun 14, 2019. It is now read-only.

template: wait for pod to teardown (if container is present) during delete #353

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

vrutkovs
Copy link
Member

@vrutkovs vrutkovs commented May 27, 2019

Pod teardown may take longer than 5 mins (default ci-operator timeout).
This commit would ensure the same timeout is applied to the teardown
container - and then applied to the pod again.

This is useful for rehearse jobs, which reuse the namespace when testing
a new commit.

TODO:

  • Add tests
  • Avoid using two watchers? Extend timeout to 10 mins if there is a teardown container?

Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1707486

@openshift-ci-robot openshift-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label May 27, 2019
pkg/steps/template.go Outdated Show resolved Hide resolved
@vrutkovs vrutkovs force-pushed the wait-for-pod-to-teardown branch from 8f8befa to 3e3d553 Compare May 28, 2019 15:12
@openshift-ci-robot openshift-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels May 28, 2019
@vrutkovs vrutkovs force-pushed the wait-for-pod-to-teardown branch from 3e3d553 to 37218ef Compare May 29, 2019 17:43
@openshift-ci-robot openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels May 29, 2019
@vrutkovs
Copy link
Member Author

Created a better version of this:

  1. added a few helper functions
  2. Before the loop wait for teardown to complete (if this container exists in the pod)
  3. Wait for pod to be deleted

I didn't get a chance to test this yet

Copy link
Member

@petr-muller petr-muller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The structure is better, but I think the fundamental problem stays (more inline).

Plus, I caught myself wondering - what problem does this actually solve? If we wait for a pod to be deleted, what good is waiting for its containers to terminate? Can you describe the problem that this PR would prevent?

pkg/steps/template.go Outdated Show resolved Hide resolved
pkg/steps/template.go Outdated Show resolved Hide resolved
pkg/steps/template.go Outdated Show resolved Hide resolved
pkg/steps/template.go Outdated Show resolved Hide resolved
@vrutkovs
Copy link
Member Author

vrutkovs commented Jun 3, 2019

what problem does this actually solve? If we wait for a pod to be deleted, what good is waiting for its containers to terminate?

ci-operator would wait for 300 secs only. If teardown didn't finish by that time the pod would be removed: leftover artifacts (usually Route53 records) would remain and cause issues on next retest

pkg/steps/template.go Outdated Show resolved Hide resolved
@vrutkovs
Copy link
Member Author

vrutkovs commented Jun 4, 2019

/hold

I can't come up with a way to test this yet

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 4, 2019
@vrutkovs
Copy link
Member Author

vrutkovs commented Jun 7, 2019

/cc @stevekuznetsov

Copy link
Contributor

@stevekuznetsov stevekuznetsov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm missing something here -- the pod actually being deleted and gone from the API server is a stronger requirement than the teardown container inside of it being terminated. Why are we making this change?

pkg/steps/template.go Outdated Show resolved Hide resolved
pkg/steps/template.go Outdated Show resolved Hide resolved
pkg/steps/template.go Outdated Show resolved Hide resolved
@vrutkovs
Copy link
Member Author

vrutkovs commented Jun 8, 2019

Why are we making this change?

When test gets cancelled - a new commit pushed in rehearse tests for instance - ci-operator would send termination signal and wait for pod to be gone for 5 mins. In most install tests teardown + artifacts take longer than 5 mins.

This change would wait longer if pod has teardown container - so that it won't affect unit tests / release jobs etc. It would first wait 5 mins for teardown to complete and then 5 mins for pod to disappear (in most of the installs teardown completes in 5-7 minutes).

See also #353 (comment)

@stevekuznetsov
Copy link
Contributor

OK, makes sense. Why not start a watch?

@vrutkovs vrutkovs force-pushed the wait-for-pod-to-teardown branch from abb16d4 to 3448245 Compare June 10, 2019 09:03
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: vrutkovs
To complete the pull request process, please assign stevekuznetsov
You can assign the PR to them by writing /assign @stevekuznetsov in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@vrutkovs vrutkovs force-pushed the wait-for-pod-to-teardown branch from 3448245 to 3211bf0 Compare June 10, 2019 09:08
@vrutkovs
Copy link
Member Author

Reworked this to leverage polls and watches:

  • waitForTeardownToComplete polls for a list of containers
  • waitForPodDeletion watches for pod to be removed with a timeout

@vrutkovs vrutkovs force-pushed the wait-for-pod-to-teardown branch 2 times, most recently from 54c3b8a to 4993c4a Compare June 10, 2019 09:32
pkg/steps/template.go Outdated Show resolved Hide resolved
@vrutkovs vrutkovs force-pushed the wait-for-pod-to-teardown branch 2 times, most recently from 71b5c95 to 1f2f01f Compare June 10, 2019 11:52
@vrutkovs vrutkovs force-pushed the wait-for-pod-to-teardown branch from 1f2f01f to b879a2a Compare June 10, 2019 12:41
pkg/steps/template.go Outdated Show resolved Hide resolved
@vrutkovs vrutkovs force-pushed the wait-for-pod-to-teardown branch from b879a2a to 6adf813 Compare June 10, 2019 13:33
@petr-muller
Copy link
Member

LGTM, let's give @stevekuznetsov a chance to review

Copy link
Contributor

@stevekuznetsov stevekuznetsov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add unit tests for these functions?

return nil
}

// Check that pod with this name exists and has the same UID
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is misleading

time.Sleep(2 * time.Second)

for _, status := range append(append([]coreapi.ContainerStatus{}, pod.Status.InitContainerStatuses...), pod.Status.ContainerStatuses...) {
if status.Name == "teardown" && status.State.Terminated != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is teardown ever an initcontainer?

timeout := 5 * time.Minute

log.Printf("Waiting for pod %s to complete teardown ...", name)
wait.Poll(10*time.Second, timeout, func() (done bool, err error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We used to poll ever 2s -- why change?

func waitForPodDeletion(podClient coreclientset.PodInterface, name string, uid types.UID) error {
timeout := 5 * time.Minute
pod, err := checkPodExistsAndValid(podClient, name, uid)
if err != nil || pod == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the case that err == nil but pod ==nil why do you return a nil err here? please leave a comment

return pod, nil
}

func waitForPodDeletion(podClient coreclientset.PodInterface, name string, uid types.UID) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

waitForPodDeletion is no longer valid as a name -- were all callers expecting this new behavior?

}
}

watcher, err := podClient.Watch(meta.ListOptions{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're setting up a watch, why not just use it for all of the interaction? Why the poll?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think container terminate status can be watched, can it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not? You'd get any changes to PodStatus if I understand Watches correctly

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(watch the Pod, not the container)


return fmt.Errorf("waited for pod %s deletion for %ds, was not deleted", name, timeout)
log.Printf("Waiting for pod %s to be deleted in %d seconds", name, timeout)
_, err = watch.Until(timeout, watcher, func(event watch.Event) (done bool, err error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why a 5min watch for deleted after a 5min retry on the container step?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Artifacts upload also take time to complete

for _, status := range append(append([]coreapi.ContainerStatus{}, pod.Status.InitContainerStatuses...), pod.Status.ContainerStatuses...) {
names = append(names, status.Name)
}
sort.Strings(names)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?


// Attempts to wait for teardown to complete
containerNames := podContainerNames(pod)
if sort.SearchStrings(containerNames, "teardown") < len(containerNames) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I like if sets.NewString(containerNames).Has("teardown") a lot more than these types of manipulations

@vrutkovs vrutkovs force-pushed the wait-for-pod-to-teardown branch 2 times, most recently from f758b7d to 697f542 Compare June 11, 2019 11:33
@vrutkovs
Copy link
Member Author

Simplified this:

  • If pod has teardown container timeout is extended by 10 mins (average teardown takes 10 min on AWS, but might take longer on openstack)
  • waitForPodDeletion now uses a watcher instead of polling.
    Should it also print the message when teardown has completed?

// pod was deleted
return true, nil
case watch.Added, watch.Modified:
if hasTeardownContainer {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this logic. If we have a teardown container, we will exit out early every time. If we don't have a teardown container, we set this boolean to false. The comment says that will avoid re-checking, but in reality that means we check every time. Then, if the teardown container is terminated, you signal you are done, so the watch ends. I think we just need a dead-simple watch, or two watches. If you want to have one watch with a variable timeout, just wait for the deletion. If you want to wait for the teardown container completion and the pod deletion separately, you will want separate watches.

In general, if the issue was a too-short timeout that cut the teardown container short, why not just make this watch go on for an hour? In what cases do we not want to wait for the Pod to really be gone?

Also, if you look at the implementations in the build utils, we want a list then a watch with retires to handle transient errors.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have a teardown container, we will exit out early every time

Fixed by introducing teardownFinished var

If you want to wait for the teardown container completion and the pod deletion separately, you will want separate watches.

That was my initial idea (see f758b7d), however there is short time (between teardown watch and pod to be deleted watch) where pod may be destroyed and could have been replaced with a new pod. Two watches don't seem reliable to me

why not just make this watch go on for an hour? In what cases do we not want to wait for the Pod to really be gone?

That would hide potential issues in teardown

we want a list then a watch with retires to handle transient errors.

Using event, ok := <-watcher.ResultChan()? It doesn't seem to have some kind of timeout

…elete

Pod teardown may take longer than 5 mins (default ci-operator timeout).
This commit would ensure the timeout is extended to wait for teardown
container to complete

This is useful for rehearse jobs, which reuse the namespace when testing
a new commit
@vrutkovs vrutkovs force-pushed the wait-for-pod-to-teardown branch from 697f542 to 2a10196 Compare June 12, 2019 08:33
@stevekuznetsov
Copy link
Contributor

That would hide potential issues in teardown

Which? Before we spend more time working on an implementation can we determine why the (stupid simple) approach of doing more retries over a 10, 20, 30 minute period would not be appropriate? Do we have some SLA for teardown time?

@vrutkovs
Copy link
Member Author

Extending the timeout is the simplest approach, and its valid, however it would apply to all ci-operator pods.

e2e-aws's teardown is the only one I know of which takes longer than 5 mins at the moment, other types of tests may rely on existing timeout.

This PR is just one possible way, of course. If it looks overcomplicated then lets just bump the timeout on teardown to fix rehearse failures at least

@stevekuznetsov
Copy link
Contributor

Of course it would hit all pods, but we poll every 2 seconds right now, so the only cases where increasing the timeout would actually increase the time taken for the test to run is if the pod is not returning in the current timeout, and then it would only increase it by the time taken to finish tearing down, right?

@vrutkovs
Copy link
Member Author

Right, it appears a larger timeout would act the same. Created #358 instead.

I'll keep this open for now

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants