Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v2 regression] nerdctl start fails after restarting the host (msg="failed to lock state dir: unable to unmarshall lifecycle data: unexpected end of JSON input") #3350

Closed
AkihiroSuda opened this issue Aug 23, 2024 · 5 comments
Labels
bug Something isn't working priority/high

Comments

@AkihiroSuda
Copy link
Member

Description

nerdctl start fails after restarting the host

Steps to reproduce the issue

  1. sudo nerdctl run -d --name foo busybox sleep infinity
  2. Reboot the host
$ sudo nerdctl start foo
FATA[0000] 1 errors:
failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running createRuntime hook #0: exit status 1, stdout: , stderr: time="2024-08-23T14:07:24+09:00" level=fatal msg="failed to lock state dir: unable to unmarshall lifecycle data: unexpected end of JSON input": unknown

Describe the results you received and expected

Received: error
Expected: starts

What version of nerdctl are you using?

$ sudo nerdctl version
Client:
 Version:	v2.0.0-rc.1
 OS/Arch:	linux/amd64
 Git commit:	778975fcaa57e365ee44dbad4a9d8d63180ae320
 buildctl:
  Version:	v0.0.0+unknown

Server:
 containerd:
  Version:	v2.0.0-rc.3
  GitCommit:	27de5fea738a38345aa1ac7569032261a6b1e562
 runc:
  Version:	1.2.0-rc.2
  GitCommit:	v1.2.0-rc.2-0-gf2d2ee5e

Are you using a variant of nerdctl? (e.g., Rancher Desktop)

None

Host information

$ sudo nerdctl info
Client:
 Namespace:	default
 Debug Mode:	false

Server:
 Server Version: v2.0.0-rc.3
 Storage Driver: overlayfs
 Logging Driver: json-file
  Cgroup Driver:  : systemd
  Cgroup Version: : 2
 Plugins:
  Log:     fluentd journald json-file syslog
  Storage: native overlayfs
 Security Options:
  apparmor
  seccomp
   Profile:	builtin
  cgroupns
 Kernel Version:   6.8.0-40-generic
 Operating System: Ubuntu 24.04 LTS
 OSType:           linux
 Architecture:     x86_64
 CPUs:             4
 Total Memory:     15.57GiB
 Name:             suda-ws01
 ID:               5097992c-2934-44bd-a1cb-5ac20cf2013f
@AkihiroSuda AkihiroSuda added kind/unconfirmed-bug-claim Unconfirmed bug claim bug Something isn't working priority/high and removed kind/unconfirmed-bug-claim Unconfirmed bug claim labels Aug 23, 2024
@apostasie
Copy link
Contributor

apostasie commented Aug 23, 2024

I am getting different errors after reboot (lima VM):

$ sudo nerdctl start foo
FATA[0000] 1 errors:
failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: time="2024-08-22T23:29:54-07:00" level=error msg="failed re-acquiring name - see https://github.com/containerd/nerdctl/issues/2992" error="name \"foo\" is already used by ID \"f58b1233941da3daabdff9ce4befde5ce0fea208109dcb421b1985c311bb3109\""
time="2024-08-22T23:29:54-07:00" level=fatal msg="failed to call cni.Setup: plugin type=\"bridge\" failed (add): failed to allocate for range 0: 10.4.0.90 has been allocated to default-f58b1233941da3daabdff9ce4befde5ce0fea208109dcb421b1985c311bb3109, duplicate allocation is not allowed": unknown

Just repeating the command then succeed:

$ sudo nerdctl start foo
foo
Client:
 Namespace:	default
 Debug Mode:	false

Server:
 Server Version: v1.7.16
 Storage Driver: overlayfs
 Logging Driver: json-file
  Cgroup Driver:  : systemd
  Cgroup Version: : 2
 Plugins:
  Log:     fluentd journald json-file syslog
  Storage: native overlayfs stargz
 Security Options:
  apparmor
  seccomp
   Profile:	builtin
  cgroupns
 Kernel Version:   6.8.0-41-generic
 Operating System: Ubuntu 24.04 LTS
 OSType:           linux
 Architecture:     aarch64
 CPUs:             4
 Total Memory:     3.814GiB
 Name:             lima-default
 ID: 464a9e70-deeb-4d7d-851b-cef4efbd3742

@AkihiroSuda
Copy link
Member Author

AkihiroSuda commented Aug 23, 2024

You may repro the issue with lima-vm/lima@fe7c317 (lima-vm/lima#2178)

@AkihiroSuda AkihiroSuda changed the title nerdctl start fails after restarting the host (msg="failed to lock state dir: unable to unmarshall lifecycle data: unexpected end of JSON input") [v2 regression] nerdctl start fails after restarting the host (msg="failed to lock state dir: unable to unmarshall lifecycle data: unexpected end of JSON input") Aug 23, 2024
@apostasie
Copy link
Contributor

apostasie commented Aug 23, 2024

Will try tomorrow with updated lima + containerd v2rc.

Note:

Issue seems to have been here for some time somehow.

This is what I get with nerdctl 1.7 + ctd 1.7 on reboot:

sudo ./nerdctl start foo
FATA[0000] 1 errors:
failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: time="2024-08-22T23:40:46-07:00" level=fatal msg="failed to call cni.Setup: plugin type=\"bridge\" failed (add): failed to allocate for range 0: 10.4.0.92 has been allocated to default-ac9d84dbfc0b4873564385f72cde9efd872bc2608d185fb6414cd7f8aaee7de4, duplicate allocation is not allowed"
Failed to write to log, write /var/lib/nerdctl/1935db59/containers/default/ac9d84dbfc0b4873564385f72cde9efd872bc2608d185fb6414cd7f8aaee7de4/oci-hook.createRuntime.log: file already closed: unknown
sudo ./nerdctl version
Client:
 Version:	v1.7.6
 OS/Arch:	linux/arm64
 Git commit:	845e989f69d25b420ae325fedc8e70186243fd93
 buildctl:
  Version:	v0.12.5
  GitCommit:	bac3f2b673f3f9d33e79046008e7a38e856b3dc6

Server:
 containerd:
  Version:	v1.7.16
  GitCommit:	83031836b2cf55637d7abf847b17134c51b38e53
 runc:
  Version:	1.7.19
  GitCommit:	v1.1.13-0-g58aa920

Same thing here, on the second time you invoke start, it works.

@apostasie
Copy link
Contributor

The original error you saw ("failed to lock state dir") clearly points to:

return fmt.Errorf("failed to lock state dir: %w", err)

Although there is clearly more to it (as prior versions seems to fail as well), this one is suggesting that there is something wrong going on with filesystem locks / reboot.

@AkihiroSuda
Copy link
Member Author

Seems solved in the current main branch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working priority/high
Projects
None yet
Development

No branches or pull requests

2 participants