Skip to content
This repository has been archived by the owner on Mar 9, 2022. It is now read-only.

[WIP] Add timeout for container/sandbox recover. #884

Closed

Conversation

Random-Liu
Copy link
Member

@Random-Liu Random-Liu commented Aug 24, 2018

Although we don't want, containerd-shim sometimes hangs, e.g. containerd/containerd#2438. We should definitely root cause and fix it.

However, on the other hand, we should make sure that the system still functional when some containerd-shims hang. An example is that ctr task ls will hang if one single containerd-shim hangs. --> We should probably fix it. @crosbymichael

Luckily, the CRI plugin in most cases doesn't handle multiple containers at a time, this makes sure that a single container failure won't block other containers.

However, the only case that CRI plugin may handle multiple containers at a time is the recovery logic. This means that if a containerd-shim hangs, CRI plugin won't be able to restart. This is super bad.

This PR adds a timeout to per container/sandbox recovery logic, so that even one container hangs, we just don't load it, and continue dealing with other containers.

Signed-off-by: Lantao Liu [email protected]

@Random-Liu Random-Liu added this to the v1.0 milestone Aug 24, 2018
@Random-Liu
Copy link
Member Author

We should cherry-pick this into all supported branches.

@crosbymichael
Copy link
Member

LGTM

Copy link
Member

@mikebrow mikebrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/LGTM

@yujuhong
Copy link
Member

LGTM

One question: if the load request timed out, how is the container going to be represented in the CRI plugin -- does CRI show the existence of the container and what state will it be in?

@Random-Liu
Copy link
Member Author

@yujuhong Currently, won't show the container, but I think it would be better to show the container in Unknown state. Let me see whether showing a bad container will cause any issues.

BTW, how does Kubelet deal with unknown state? I remember it triggers a sync pod, and kubelet will try to start a new one.

@yujuhong
Copy link
Member

Yes, kubelet will try to kill the container in the unknown state and start a new one.

@Random-Liu
Copy link
Member Author

Random-Liu commented Aug 25, 2018

I found this PR doesn't actually help, because:

  1. There is no grpc between CRI plugin and containerd, so no one handles the timeout;
  2. The context will be eventually passed to containerd-shim, and ideally the rpc client will handle the timeout. However, ttrpc doesn't support timeout yet ttrpc: end to end timeout support ttrpc#3.

Given so, I'll leave this PR here now... We need a better timeout solution to make the system more reliable. I filed an issue containerd/containerd#2578.

@Random-Liu Random-Liu changed the title Add timeout for container/sandbox recover. [WIP] Add timeout for container/sandbox recover. Aug 25, 2018
@Random-Liu Random-Liu removed this from the v1.0 milestone Aug 25, 2018
@Random-Liu Random-Liu closed this Sep 10, 2018
@Random-Liu Random-Liu deleted the add-timeout-for-recover branch September 10, 2018 18:48
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants