[v2 regression] `nerdctl start` fails after restarting the host (`msg="failed to lock state dir: unable to unmarshall lifecycle data: unexpected end of JSON input"`) #3350

AkihiroSuda · 2024-08-23T06:18:25Z

Description

nerdctl start fails after restarting the host

Steps to reproduce the issue

sudo nerdctl run -d --name foo busybox sleep infinity
Reboot the host

$ sudo nerdctl start foo
FATA[0000] 1 errors:
failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running createRuntime hook #0: exit status 1, stdout: , stderr: time="2024-08-23T14:07:24+09:00" level=fatal msg="failed to lock state dir: unable to unmarshall lifecycle data: unexpected end of JSON input": unknown

Describe the results you received and expected

Received: error
Expected: starts

What version of nerdctl are you using?

$ sudo nerdctl version
Client:
 Version:	v2.0.0-rc.1
 OS/Arch:	linux/amd64
 Git commit:	778975fcaa57e365ee44dbad4a9d8d63180ae320
 buildctl:
  Version:	v0.0.0+unknown

Server:
 containerd:
  Version:	v2.0.0-rc.3
  GitCommit:	27de5fea738a38345aa1ac7569032261a6b1e562
 runc:
  Version:	1.2.0-rc.2
  GitCommit:	v1.2.0-rc.2-0-gf2d2ee5e

Are you using a variant of nerdctl? (e.g., Rancher Desktop)

None

Host information

$ sudo nerdctl info
Client:
 Namespace:	default
 Debug Mode:	false

Server:
 Server Version: v2.0.0-rc.3
 Storage Driver: overlayfs
 Logging Driver: json-file
  Cgroup Driver:  : systemd
  Cgroup Version: : 2
 Plugins:
  Log:     fluentd journald json-file syslog
  Storage: native overlayfs
 Security Options:
  apparmor
  seccomp
   Profile:	builtin
  cgroupns
 Kernel Version:   6.8.0-40-generic
 Operating System: Ubuntu 24.04 LTS
 OSType:           linux
 Architecture:     x86_64
 CPUs:             4
 Total Memory:     15.57GiB
 Name:             suda-ws01
 ID:               5097992c-2934-44bd-a1cb-5ac20cf2013f

The text was updated successfully, but these errors were encountered:

apostasie · 2024-08-23T06:31:35Z

I am getting different errors after reboot (lima VM):

$ sudo nerdctl start foo
FATA[0000] 1 errors:
failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: time="2024-08-22T23:29:54-07:00" level=error msg="failed re-acquiring name - see https://github.com/containerd/nerdctl/issues/2992" error="name \"foo\" is already used by ID \"f58b1233941da3daabdff9ce4befde5ce0fea208109dcb421b1985c311bb3109\""
time="2024-08-22T23:29:54-07:00" level=fatal msg="failed to call cni.Setup: plugin type=\"bridge\" failed (add): failed to allocate for range 0: 10.4.0.90 has been allocated to default-f58b1233941da3daabdff9ce4befde5ce0fea208109dcb421b1985c311bb3109, duplicate allocation is not allowed": unknown

Just repeating the command then succeed:

$ sudo nerdctl start foo
foo

Client:
 Namespace:	default
 Debug Mode:	false

Server:
 Server Version: v1.7.16
 Storage Driver: overlayfs
 Logging Driver: json-file
  Cgroup Driver:  : systemd
  Cgroup Version: : 2
 Plugins:
  Log:     fluentd journald json-file syslog
  Storage: native overlayfs stargz
 Security Options:
  apparmor
  seccomp
   Profile:	builtin
  cgroupns
 Kernel Version:   6.8.0-41-generic
 Operating System: Ubuntu 24.04 LTS
 OSType:           linux
 Architecture:     aarch64
 CPUs:             4
 Total Memory:     3.814GiB
 Name:             lima-default
 ID: 464a9e70-deeb-4d7d-851b-cef4efbd3742

AkihiroSuda · 2024-08-23T06:38:43Z

You may repro the issue with lima-vm/lima@fe7c317 (lima-vm/lima#2178)

apostasie · 2024-08-23T06:45:41Z

Will try tomorrow with updated lima + containerd v2rc.

Note:

Issue seems to have been here for some time somehow.

This is what I get with nerdctl 1.7 + ctd 1.7 on reboot:

sudo ./nerdctl start foo
FATA[0000] 1 errors:
failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: time="2024-08-22T23:40:46-07:00" level=fatal msg="failed to call cni.Setup: plugin type=\"bridge\" failed (add): failed to allocate for range 0: 10.4.0.92 has been allocated to default-ac9d84dbfc0b4873564385f72cde9efd872bc2608d185fb6414cd7f8aaee7de4, duplicate allocation is not allowed"
Failed to write to log, write /var/lib/nerdctl/1935db59/containers/default/ac9d84dbfc0b4873564385f72cde9efd872bc2608d185fb6414cd7f8aaee7de4/oci-hook.createRuntime.log: file already closed: unknown

sudo ./nerdctl version
Client:
 Version:	v1.7.6
 OS/Arch:	linux/arm64
 Git commit:	845e989f69d25b420ae325fedc8e70186243fd93
 buildctl:
  Version:	v0.12.5
  GitCommit:	bac3f2b673f3f9d33e79046008e7a38e856b3dc6

Server:
 containerd:
  Version:	v1.7.16
  GitCommit:	83031836b2cf55637d7abf847b17134c51b38e53
 runc:
  Version:	1.7.19
  GitCommit:	v1.1.13-0-g58aa920

Same thing here, on the second time you invoke start, it works.

apostasie · 2024-08-23T06:54:25Z

The original error you saw ("failed to lock state dir") clearly points to:

nerdctl/pkg/ocihook/state/state.go

Line 59 in 607d560

return fmt.Errorf("failed to lock state dir: %w", err)

Although there is clearly more to it (as prior versions seems to fail as well), this one is suggesting that there is something wrong going on with filesystem locks / reboot.

AkihiroSuda · 2024-08-23T07:29:40Z

Seems solved in the current main branch

AkihiroSuda added kind/unconfirmed-bug-claim Unconfirmed bug claim bug Something isn't working priority/high and removed kind/unconfirmed-bug-claim Unconfirmed bug claim labels Aug 23, 2024

AkihiroSuda closed this as completed Aug 23, 2024

This was referenced Aug 23, 2024

Container errors on restart #3352

Closed

Hardening lifecycle-state-store, name-store, and oci-hooks #3362

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v2 regression] `nerdctl start` fails after restarting the host (`msg="failed to lock state dir: unable to unmarshall lifecycle data: unexpected end of JSON input"`) #3350

[v2 regression] `nerdctl start` fails after restarting the host (`msg="failed to lock state dir: unable to unmarshall lifecycle data: unexpected end of JSON input"`) #3350

AkihiroSuda commented Aug 23, 2024

apostasie commented Aug 23, 2024 •

edited

Loading

AkihiroSuda commented Aug 23, 2024 •

edited

Loading

apostasie commented Aug 23, 2024 •

edited

Loading

apostasie commented Aug 23, 2024

AkihiroSuda commented Aug 23, 2024

[v2 regression] nerdctl start fails after restarting the host (msg="failed to lock state dir: unable to unmarshall lifecycle data: unexpected end of JSON input") #3350

[v2 regression] nerdctl start fails after restarting the host (msg="failed to lock state dir: unable to unmarshall lifecycle data: unexpected end of JSON input") #3350

Comments

AkihiroSuda commented Aug 23, 2024

Description

Steps to reproduce the issue

Describe the results you received and expected

What version of nerdctl are you using?

Are you using a variant of nerdctl? (e.g., Rancher Desktop)

Host information

apostasie commented Aug 23, 2024 • edited Loading

AkihiroSuda commented Aug 23, 2024 • edited Loading

apostasie commented Aug 23, 2024 • edited Loading

apostasie commented Aug 23, 2024

AkihiroSuda commented Aug 23, 2024

[v2 regression] `nerdctl start` fails after restarting the host (`msg="failed to lock state dir: unable to unmarshall lifecycle data: unexpected end of JSON input"`) #3350

[v2 regression] `nerdctl start` fails after restarting the host (`msg="failed to lock state dir: unable to unmarshall lifecycle data: unexpected end of JSON input"`) #3350

apostasie commented Aug 23, 2024 •

edited

Loading

AkihiroSuda commented Aug 23, 2024 •

edited

Loading

apostasie commented Aug 23, 2024 •

edited

Loading