Container errors on restart #3352

apostasie · 2024-08-23T19:03:08Z

Description

This is a variant of #3350

Containers cannot be restarted after being shutdown by containerd stopping, and are generally in a broken state.

This is against containerd v1.7 (unlike 3350 which was testing against ctd v2).

Steps to reproduce the issue

Reproduction is:

nerdctl rm -f foo
nerdctl run -d --name foo debian sleep Inf
systemctl --user stop containerd
systemctl --user start containerd

Then

nerdctl start foo

or

nerdctl stop foo

Describe the results you received and expected

There are clearly multiple issues.

Fist is:

inability of the container to re-acquire its name in the name store

This issue affects only main (and not 1.7)
I have a local patch for that that I will send shortly.

Second is:

bridge plugin refusing to return already allocated ip

level=fatal msg="failed to call cni.Setup: plugin type=\"bridge\" failed (add): failed to allocate for range 0: 10.4.0.229 has been allocated to default-ec2a02d4f734a18adf2292b4a5efbcb0d5e2581198ea54653c63bdde05bdc1f1, duplicate allocation is not allowed": unknown

This is definitely coming from https://github.com/containernetworking/plugins/blob/main/plugins/ipam/host-local/backend/allocator/allocator.go#L83

This has been there for some time and affects both 1.7 and main.

This needs discussion.
Should we modify the allocator over there and return the already allocated ip instead of failing?

Third is:

if stop cannot find the container Task, it does return container not found
This is probably wide spread in our codebase and other commands may also fail for the same reason.

Issues 2 and 3 might be related.

I'll look into these and figure out if we can fix or workaround, then test with different network types, reboots and also containerd v2.

cc @AkihiroSuda we should flag this urgent - although this is apparently not new, this is a pretty bad set of issues.

What version of nerdctl are you using?

main

Are you using a variant of nerdctl? (e.g., Rancher Desktop)

None

Host information

Client:
 Namespace:	default
 Debug Mode:	false

Server:
 Server Version: v1.7.16
 Storage Driver: overlayfs
 Logging Driver: json-file
  Cgroup Driver:  : systemd
  Cgroup Version: : 2
 Plugins:
  Log:     fluentd journald json-file syslog
  Storage: native overlayfs stargz fuse-overlayfs
 Security Options:
  apparmor
  seccomp
   Profile:	builtin
  cgroupns
  rootless
 Kernel Version:   6.8.0-41-generic
 Operating System: Ubuntu 24.04 LTS
 OSType:           linux
 Architecture:     aarch64
 CPUs:             4
 Total Memory:     3.814GiB
 Name:             lima-default
 ID:               cd6896f4-2884-435e-b455-72137115b4fe

WARNING: AppArmor profile "nerdctl-default" is not loaded.
         Use 'sudo nerdctl apparmor load' if you prefer to use AppArmor with rootless mode.
         This warning is negligible if you do not intend to use AppArmor.

The text was updated successfully, but these errors were encountered:

apostasie · 2024-08-24T05:55:47Z

The 3 issues preventing restart should be fixed with #3356

The bottom-line is that when containerd restarts, calling start on previously running containers will actually make them go through onCreateRuntime again.

This is unexpected for me - as the normal stop/start flow does NOT do that - and very likely unexpected for other contributors as well.

We should have a cold hard look at what is going on inside onCreateRuntime and make sure we account for the fact that it may be run multiple times for a single container without ever hitting onPostStop.

Another concerning issue is (unfixed) #3357 - which may be a runc issue. I cannot think of a simple workaround for it, and it will bite us again next time we have failures in onCreateRuntime.

apostasie · 2024-08-25T00:00:05Z

#3356 and #3362 addressed a slew of issues and makes us more resistant to unexpected conditions.

I am going to close this, as I am now able to restart without errors.

Though, one issue remains that will still break the namestore ( #3357 ), and though problematic, it should not happen under normal conditions and possibly requires an upstream fix - should be addressed separately.

apostasie added the kind/unconfirmed-bug-claim Unconfirmed bug claim label Aug 23, 2024

AkihiroSuda added bug Something isn't working priority/high and removed kind/unconfirmed-bug-claim Unconfirmed bug claim labels Aug 24, 2024

apostasie closed this as completed Aug 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Container errors on restart #3352

Container errors on restart #3352

apostasie commented Aug 23, 2024 •

edited

Loading

apostasie commented Aug 24, 2024

apostasie commented Aug 25, 2024

Container errors on restart #3352

Container errors on restart #3352

Comments

apostasie commented Aug 23, 2024 • edited Loading

Description

Steps to reproduce the issue

Describe the results you received and expected

What version of nerdctl are you using?

Are you using a variant of nerdctl? (e.g., Rancher Desktop)

Host information

apostasie commented Aug 24, 2024

apostasie commented Aug 25, 2024

apostasie commented Aug 23, 2024 •

edited

Loading