Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: podman checkpoint container with --pre-checkpoint not working in container testing #24230

Open
Luap99 opened this issue Oct 10, 2024 · 8 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. stale-issue

Comments

@Luap99
Copy link
Member

Luap99 commented Oct 10, 2024

With the latest image update (#24227) checkpoint is broken inside the container test:

→ Enter [It] podman checkpoint container with --pre-checkpoint - /var/tmp/go/src/github.com[/containers/podman/test/e2e/checkpoint_test.go:969](https://github.com/containers/podman/blob/ee70c495901ce4865b8a61290700c027eabd7937/test/e2e/checkpoint_test.go#L969) @ 10/10/24 14:04:37.825
           # podman [options] run -d --network podman5 quay.io/libpod/alpine:latest top
           6d1f1d2b3d02e8d920b33038860e7bfdf077712b3f99389a1866be88393ab22c
           # podman [options] container checkpoint -P 6d1f1d2b3d02e8d920b33038860e7bfdf077712b3f99389a1866be88393ab22c
           *** buffer overflow detected ***: terminated
           CRIU feature checking failed -52.  Please check CRIU logfile /tmp/CI_Nlm2/podman-e2e-190218032/subtest-2996264589/root/overlay-containers/6d1f1d2b3d02e8d920b33038860e7bfdf077712b3f99389a1866be88393ab22c/userdata/dump.log
           Error: `/usr/bin/crun checkpoint --image-path /tmp/CI_Nlm2/podman-e2e-190218032/subtest-2996264589/root/overlay-containers/6d1f1d2b3d02e8d920b33038860e7bfdf077712b3f99389a1866be88393ab22c/userdata/pre-checkpoint --work-path /tmp/CI_Nlm2/podman-e2e-190218032/subtest-2996264589/root/overlay-containers/6d1f1d2b3d02e8d920b33038860e7bfdf077712b3f99389a1866be88393ab22c/userdata --pre-dump 6d1f1d2b3d02e8d920b33038860e7bfdf077712b3f99389a1866be88393ab22c` failed: exit status 1

           [FAILED] Command failed with exit status 125. See above for error message.

Both podman checkpoint container with --pre-checkpoint and
podman checkpoint container with --pre-checkpoint and export (migration) fail the same way

https://api.cirrus-ci.com/v1/artifact/task/5294903477927936/html/int-podman-fedora-40-root-container-sqlite.log.html

I don't have time to look into this so I am just going to skip this just filing this so we can track it.

@Luap99 Luap99 added the kind/bug Categorizes issue or PR as related to a bug. label Oct 10, 2024
Luap99 added a commit to Luap99/libpod that referenced this issue Oct 10, 2024
They no longer work in the latest image update, it is not clear why and
I do not have the time to debug that stuff. I opened containers#24230 to track it.

Signed-off-by: Paul Holzinger <[email protected]>
@edsantiago
Copy link
Member

See containers/automation_images#387 (comment) , in particular, the criu 4.0 update:

debian prior-fedora fedora fedora-aws rawhide
criu 3.17.1-3 3.19-2 4.0-1 3.19-4 4.0-1
3.19-6 ⇑ 3.19-7 ⇑

@Luap99 Luap99 changed the title podman checkpoint container with --pre-checkpoint CI: podman checkpoint container with --pre-checkpoint not working in container testing Oct 10, 2024
@Luap99
Copy link
Member Author

Luap99 commented Oct 11, 2024

Reproducer:

$ sudo bin/podman run --rm --privileged --net=host --cgroupns=host -v /var/lib/containers -v $(pwd):/repo -w /repo -v /tmp:/tmp -it quay.io/libpod/fedora_podman:c20241010t105554z-f40f39d13 bash

[root@pholzing-fedora repo]# bin/podman run -d --name test quay.io/libpod/alpine:latest top
8a080765b0f5aed1138e6ffb0d6c1c04a48aee93cf96776ba7059b6e775e8be8
[root@pholzing-fedora repo]# bin/podman container checkpoint -P test
*** buffer overflow detected ***: terminated
2024-10-11T14:58:53.008984Z: CRIU feature checking failed -52.  Please check CRIU logfile /var/lib/containers/storage/overlay-containers/8a080765b0f5aed1138e6ffb0d6c1c04a48aee93cf96776ba7059b6e775e8be8/userdata/dump.log
Error: `/usr/bin/crun checkpoint --image-path /var/lib/containers/storage/overlay-containers/8a080765b0f5aed1138e6ffb0d6c1c04a48aee93cf96776ba7059b6e775e8be8/userdata/pre-checkpoint --work-path /var/lib/containers/storage/overlay-containers/8a080765b0f5aed1138e6ffb0d6c1c04a48aee93cf96776ba7059b6e775e8be8/userdata --pre-dump 8a080765b0f5aed1138e6ffb0d6c1c04a48aee93cf96776ba7059b6e775e8be8` failed: exit status 1

And the criu logfile was empty so nothing useful to see in there.

Trying to use a normal fedora image as base then install podman does not seem to reproduce and I tried both criu-3.19-4 and criu-4.0-1 so there must be some magic in our special test image.

@adrianreber @rst0git Any ideas what could cause *** buffer overflow detected ***: terminated?

@rst0git
Copy link
Contributor

rst0git commented Oct 14, 2024

@Luap99 Would it be possible to confirm if the error appears with both runc and crun, or only with crun?

@Luap99
Copy link
Member Author

Luap99 commented Oct 15, 2024

Well this is fun now I am no longer able to reproduce using the steps from above so I cannot tell.

@rst0git
Copy link
Contributor

rst0git commented Oct 15, 2024

@Luap99 I was able to replicate the error locally with the following commands, and confirm that appears with both runc and crun:

cd ~/go/src/github.com/containers/podman
sudo podman run --rm --privileged --net=host --cgroupns=host -v /var/lib/containers -v $(pwd):/repo -w /repo -v /tmp:/tmp -it quay.io/libpod/fedora_podman:c20241010t105554z-f40f39d13 bash

# bin/podman run -d --name test quay.io/libpod/alpine:latest top
# bin/podman container checkpoint -P test

It looks like CRIU fails with the following error:

00.124597) Putting tsock into pid 380229
(00.125016) Wait for parasite being daemonized...
(00.125031) Wait for ack 2 on daemon socket
(00.125271) Error (compel/src/lib/infect-rpc.c:44): Message reply from daemon is trimmed (12/0)
(00.125297) Error (compel/src/lib/infect.c:726): Can't switch parasite 380229 to daemon mode 0
(00.125323) Error (compel/src/lib/ptrace.c:73): POKEDATA failed: No such process
(00.125327) Error (compel/src/lib/ptrace.c:96): Can't poke 380229 @ 0x5573bb6df000 from 0x7ffef62e4418 sized 8
(00.125334) Error (compel/src/lib/ptrace.c:73): POKEDATA failed: No such process
(00.125337) Error (compel/src/lib/ptrace.c:100): Can't restore the original data with poke
(00.125341) Error (compel/src/lib/infect.c:637): Can't inject syscall blob (pid: 380229)
(00.125345) Warn  (criu/parasite-syscall.c:439): Can't cure failed infection
(00.125349) Error (criu/cr-dump.c:1493): Can't infect (pid: 380229) with parasite
(00.125426) Unfreezing tasks into 1
(00.125431) 	Unseizing 380229 into 1
(00.125438) Error (compel/src/lib/infect.c:418): Unable to detach from 380229: No such process
(00.125451) Writing image inventory (version 1)
(00.125719) Error (criu/cr-dump.c:1905): Pre-dumping FAILED.

dump.log

@rst0git
Copy link
Contributor

rst0git commented Oct 15, 2024

I also noticed that the message *** buffer overflow detected *** appears with crun but not with runc:

crun:

DEBU[0000] the args to checkpoint: /usr/bin/crun checkpoint --image-path /var/lib/containers/storage/overlay-containers/3fbe9360c80bc925ff1f013624c2e31346448ddba08b8194d8f83749edec95c9/userdata/pre-checkpoint --work-path /var/lib/containers/storage/overlay-containers/3fbe9360c80bc925ff1f013624c2e31346448ddba08b8194d8f83749edec95c9/userdata --pre-dump 3fbe9360c80bc925ff1f013624c2e31346448ddba08b8194d8f83749edec95c9 
*** buffer overflow detected ***: terminated
2024-10-15T17:31:49.172489Z: CRIU feature checking failed -52.  Please check CRIU logfile /var/lib/containers/storage/overlay-containers/3fbe9360c80bc925ff1f013624c2e31346448ddba08b8194d8f83749edec95c9/userdata/dump.log
Error: `/usr/bin/crun checkpoint --image-path /var/lib/containers/storage/overlay-containers/3fbe9360c80bc925ff1f013624c2e31346448ddba08b8194d8f83749edec95c9/userdata/pre-checkpoint --work-path /var/lib/containers/storage/overlay-containers/3fbe9360c80bc925ff1f013624c2e31346448ddba08b8194d8f83749edec95c9/userdata --pre-dump 3fbe9360c80bc925ff1f013624c2e31346448ddba08b8194d8f83749edec95c9` failed: exit status 1
DEBU[0000] Shutting down engines                        
INFO[0000] Received shutdown.Stop(), terminating!        PID=37015

runc:

DEBU[0000] the args to checkpoint: /usr/bin/runc checkpoint --image-path /var/lib/containers/storage/overlay-containers/1a9049b53a4ddc54bff3f1bd18abd6e3f19c0c33ef43dac74dff1769ee479ee5/userdata/pre-checkpoint --work-path /var/lib/containers/storage/overlay-containers/1a9049b53a4ddc54bff3f1bd18abd6e3f19c0c33ef43dac74dff1769ee479ee5/userdata --pre-dump 1a9049b53a4ddc54bff3f1bd18abd6e3f19c0c33ef43dac74dff1769ee479ee5 
ERRO[0000] CRIU feature check failed                    
Error: `/usr/bin/runc checkpoint --image-path /var/lib/containers/storage/overlay-containers/1a9049b53a4ddc54bff3f1bd18abd6e3f19c0c33ef43dac74dff1769ee479ee5/userdata/pre-checkpoint --work-path /var/lib/containers/storage/overlay-containers/1a9049b53a4ddc54bff3f1bd18abd6e3f19c0c33ef43dac74dff1769ee479ee5/userdata --pre-dump 1a9049b53a4ddc54bff3f1bd18abd6e3f19c0c33ef43dac74dff1769ee479ee5` failed: exit status 1
DEBU[0000] Shutting down engines                        
INFO[0000] Received shutdown.Stop(), terminating!        PID=36877

@rst0git
Copy link
Contributor

rst0git commented Oct 15, 2024

@adrianreber Do you have any ideas what may cause crun and runc to fail with CRIU feature checking failed?

It is worth noting that criu check --feature mem_dirty_track shows mem_dirty_track is supported and the error disappears with the following change in Podman:

+++ b/utils/utils.go
@@ -39,7 +39,7 @@ func ExecCmdWithStdStreams(stdin io.Reader, stdout, stderr io.Writer, env []stri
        cmd.Stdin = stdin
        cmd.Stdout = stdout
        cmd.Stderr = stderr
-       cmd.Env = env
+       // cmd.Env = env
 
        err := cmd.Run()
        if err != nil {

Copy link

A friendly reminder that this issue had no activity for 30 days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. stale-issue
Projects
None yet
Development

No branches or pull requests

3 participants