-
Notifications
You must be signed in to change notification settings - Fork 348
[WIP] Add timeout for container/sandbox recover. #884
Conversation
Signed-off-by: Lantao Liu <[email protected]>
We should cherry-pick this into all supported branches. |
LGTM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/LGTM
LGTM One question: if the load request timed out, how is the container going to be represented in the CRI plugin -- does CRI show the existence of the container and what state will it be in? |
@yujuhong Currently, won't show the container, but I think it would be better to show the container in Unknown state. Let me see whether showing a bad container will cause any issues. BTW, how does Kubelet deal with unknown state? I remember it triggers a sync pod, and kubelet will try to start a new one. |
Yes, kubelet will try to kill the container in the unknown state and start a new one. |
I found this PR doesn't actually help, because:
Given so, I'll leave this PR here now... We need a better timeout solution to make the system more reliable. I filed an issue containerd/containerd#2578. |
Although we don't want, containerd-shim sometimes hangs, e.g. containerd/containerd#2438. We should definitely root cause and fix it.
However, on the other hand, we should make sure that the system still functional when some containerd-shims hang. An example is that
ctr task ls
will hang if one single containerd-shim hangs. --> We should probably fix it. @crosbymichaelLuckily, the CRI plugin in most cases doesn't handle multiple containers at a time, this makes sure that a single container failure won't block other containers.
However, the only case that CRI plugin may handle multiple containers at a time is the recovery logic. This means that if a containerd-shim hangs, CRI plugin won't be able to restart. This is super bad.
This PR adds a timeout to per container/sandbox recovery logic, so that even one container hangs, we just don't load it, and continue dealing with other containers.
Signed-off-by: Lantao Liu [email protected]