Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Container errors on restart #3352

Closed
3 tasks done
apostasie opened this issue Aug 23, 2024 · 2 comments
Closed
3 tasks done

Container errors on restart #3352

apostasie opened this issue Aug 23, 2024 · 2 comments
Labels
bug Something isn't working priority/high

Comments

@apostasie
Copy link
Contributor

apostasie commented Aug 23, 2024

Description

This is a variant of #3350

Containers cannot be restarted after being shutdown by containerd stopping, and are generally in a broken state.

This is against containerd v1.7 (unlike 3350 which was testing against ctd v2).

Steps to reproduce the issue

Reproduction is:

nerdctl rm -f foo
nerdctl run -d --name foo debian sleep Inf
systemctl --user stop containerd
systemctl --user start containerd

Then

nerdctl start foo

or

nerdctl stop foo

Describe the results you received and expected

There are clearly multiple issues.

Fist is:

  • inability of the container to re-acquire its name in the name store

This issue affects only main (and not 1.7)
I have a local patch for that that I will send shortly.

Second is:

  • bridge plugin refusing to return already allocated ip
level=fatal msg="failed to call cni.Setup: plugin type=\"bridge\" failed (add): failed to allocate for range 0: 10.4.0.229 has been allocated to default-ec2a02d4f734a18adf2292b4a5efbcb0d5e2581198ea54653c63bdde05bdc1f1, duplicate allocation is not allowed": unknown

This is definitely coming from https://github.com/containernetworking/plugins/blob/main/plugins/ipam/host-local/backend/allocator/allocator.go#L83

This has been there for some time and affects both 1.7 and main.

This needs discussion.
Should we modify the allocator over there and return the already allocated ip instead of failing?

Third is:

  • if stop cannot find the container Task, it does return container not found
    This is probably wide spread in our codebase and other commands may also fail for the same reason.

Issues 2 and 3 might be related.

I'll look into these and figure out if we can fix or workaround, then test with different network types, reboots and also containerd v2.

cc @AkihiroSuda we should flag this urgent - although this is apparently not new, this is a pretty bad set of issues.

What version of nerdctl are you using?

main

Are you using a variant of nerdctl? (e.g., Rancher Desktop)

None

Host information

Client:
 Namespace:	default
 Debug Mode:	false

Server:
 Server Version: v1.7.16
 Storage Driver: overlayfs
 Logging Driver: json-file
  Cgroup Driver:  : systemd
  Cgroup Version: : 2
 Plugins:
  Log:     fluentd journald json-file syslog
  Storage: native overlayfs stargz fuse-overlayfs
 Security Options:
  apparmor
  seccomp
   Profile:	builtin
  cgroupns
  rootless
 Kernel Version:   6.8.0-41-generic
 Operating System: Ubuntu 24.04 LTS
 OSType:           linux
 Architecture:     aarch64
 CPUs:             4
 Total Memory:     3.814GiB
 Name:             lima-default
 ID:               cd6896f4-2884-435e-b455-72137115b4fe

WARNING: AppArmor profile "nerdctl-default" is not loaded.
         Use 'sudo nerdctl apparmor load' if you prefer to use AppArmor with rootless mode.
         This warning is negligible if you do not intend to use AppArmor.
@apostasie
Copy link
Contributor Author

The 3 issues preventing restart should be fixed with #3356

The bottom-line is that when containerd restarts, calling start on previously running containers will actually make them go through onCreateRuntime again.

This is unexpected for me - as the normal stop/start flow does NOT do that - and very likely unexpected for other contributors as well.

We should have a cold hard look at what is going on inside onCreateRuntime and make sure we account for the fact that it may be run multiple times for a single container without ever hitting onPostStop.

Another concerning issue is (unfixed) #3357 - which may be a runc issue. I cannot think of a simple workaround for it, and it will bite us again next time we have failures in onCreateRuntime.

@apostasie
Copy link
Contributor Author

#3356 and #3362 addressed a slew of issues and makes us more resistant to unexpected conditions.

I am going to close this, as I am now able to restart without errors.

Though, one issue remains that will still break the namestore ( #3357 ), and though problematic, it should not happen under normal conditions and possibly requires an upstream fix - should be addressed separately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working priority/high
Projects
None yet
Development

No branches or pull requests

2 participants