Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: system tests: something is eating volumes #23913

Closed
edsantiago opened this issue Sep 10, 2024 · 11 comments
Closed

CI: system tests: something is eating volumes #23913

edsantiago opened this issue Sep 10, 2024 · 11 comments
Assignees
Labels
flakes Flakes from Continuous Integration

Comments

@edsantiago
Copy link
Member

Seeing flakes in |600| podman shell completion test when run in parallel:

$ podman __completeNoDesc something
...
# #| expected: '.*v-t361-kugfvnq4
(but no, it is not there!)

The trick is finding the culprit. There are very few run_podman volume rm commands in system tests, it's easy to check them individually, and I don't see any way for any of them to be responsible. Journal shows nothing useful.

x x x x x x
sys(2) podman(2) rawhide(1) root(1) host(2) sqlite(2)
debian-13(1) rootless(1)
@edsantiago edsantiago added the flakes Flakes from Continuous Integration label Sep 10, 2024
@edsantiago edsantiago self-assigned this Sep 10, 2024
@edsantiago
Copy link
Member Author

Seen on my laptop:

✗ |252| quadlet - volume [6509]
...
Sep 17 04:41:59 f.edsantiago.com systemd[2113]: Finished basic_t205-xzbhuwxt-volume.service.
   $ systemctl show --value --property=ActiveState basic_t205-xzbhuwxt-volume.service
   active
   
   [04:41:59.601800049] $ bin/podman volume ls
   [04:41:59.647905558] Error: no such volume

@mheon
Copy link
Member

mheon commented Sep 17, 2024

That second one looks like a TOCTOU in the volume ls code, similar to the ones Paul fixed recently around containers and pods. Probably unrelated to the first.

Copy link

A friendly reminder that this issue had no activity for 30 days.

@edsantiago
Copy link
Member Author

Yesterday, f40 aarch64 root

@Luap99
Copy link
Member

Luap99 commented Oct 29, 2024

Well we have the cleanup tracer now so we can see all podman commands that were executed and I am still baffled. I am ignoring everything before the logged podman volume create v-t388-2tpdzsr0 and then I just filtered for volume or prune but still no command logged that would explain it. There is a volume rm v-t407-e5qoxja3 logged but that is clearly another volume so at that point I can only suspect some other command (that really should not delete the volume) is deleting it

@edsantiago
Copy link
Member Author

Thanks for the confirmation. There are very few volume rm commands in system tests, and even fewer prune, and I have satisfied myself (99%) that none of those can possibly be causing this. What else could eat a volume?

@Luap99
Copy link
Member

Luap99 commented Oct 29, 2024

I have no idea, there is podman rm -v/--volumes but per description that should only delete anonymous volumes (and the ones created by volume create are never anonymous). Also it will only delete volumes attached to the removed container...

There is no easy explanation for me to blame the tests, so I have to assume we might have a bug in podman itself where unrelated things get deleted.

Luap99 added a commit to Luap99/libpod that referenced this issue Oct 30, 2024
Debug for containers#23913, I though if we have no idea which process is nuking
the volume then we need to figure this out. As there is no reproducer
we can (ab)use the cleanup tracer. Simply trace all unlink syscalls to
see which process deletes our special named volume. Given the volume
name is used as path on the fs and is deleted on volume rm we should
know exactly which process deleted it the next time hoopefully.

Signed-off-by: Paul Holzinger <[email protected]>
Luap99 added a commit to Luap99/libpod that referenced this issue Oct 30, 2024
Debug for containers#23913, I though if we have no idea which process is nuking
the volume then we need to figure this out. As there is no reproducer
we can (ab)use the cleanup tracer. Simply trace all unlink syscalls to
see which process deletes our special named volume. Given the volume
name is used as path on the fs and is deleted on volume rm we should
know exactly which process deleted it the next time hopefully.

Signed-off-by: Paul Holzinger <[email protected]>
@Luap99
Copy link
Member

Luap99 commented Nov 4, 2024

Ok this is not what I expected:
https://api.cirrus-ci.com/v1/artifact/task/6059057566449664/html/sys-podman-rawhide-rootless-host-sqlite.log.html
https://api.cirrus-ci.com/v1/artifact/task/6059057566449664/cleanup_tracer/podman-cleanup-tracer.log

No match for my magic Special issue 23913 volume deleted by pid string in the tracer log, what is going on here...
The tracer matches all unlink() and unlickat() system calls AFAICT and there is not a single process that tried to remove that file if the tracer works reliably (I cannot be 100% on that). So that means we lost the volume from the db only? And then leaked the path? I created more questions than answers...

@edsantiago
Copy link
Member Author

Hm. If we assume that the volume is still there, and (please oh please) that there are no db integrity issues, could it be a bug in the query logic?

@Luap99
Copy link
Member

Luap99 commented Nov 4, 2024

Maybe? So far only sqlite is seen but we have to few hits on that flake to exclude boltdb IMO.
Listing should be simple though and not somehow skip a volume.

Luap99 added a commit to Luap99/libpod that referenced this issue Nov 6, 2024
If volume ls was called while another volume was removed at the right
time it could have failed with "no such volume" as we did not ignore
such error during listing. As we list things and this no longer exists
the correct thing is to ignore the error and continue like we do with
containers, pods, etc...

I have a slight feeling that this might solve containers#23913 but I am not to
sure there so I am not adding a Fixes here.

Signed-off-by: Paul Holzinger <[email protected]>
Luap99 added a commit to Luap99/libpod that referenced this issue Nov 6, 2024
If volume ls was called while another volume was removed at the right
time it could have failed with "no such volume" as we did not ignore
such error during listing. As we list things and this no longer exists
the correct thing is to ignore the error and continue like we do with
containers, pods, etc...

This was pretty easy to reproduce with these two commands running in
differernt terminals:
while :; do bin/podman volume create test && bin/podman volume rm test || break; done
while :; do bin/podman volume ls || break ; done

I have a slight feeling that this might solve containers#23913 but I am not to
sure there so I am not adding a Fixes here.

Signed-off-by: Paul Holzinger <[email protected]>
Luap99 added a commit to Luap99/libpod that referenced this issue Nov 6, 2024
If volume ls was called while another volume was removed at the right
time it could have failed with "no such volume" as we did not ignore
such error during listing. As we list things and this no longer exists
the correct thing is to ignore the error and continue like we do with
containers, pods, etc...

This was pretty easy to reproduce with these two commands running in
different terminals:
while :; do bin/podman volume create test && bin/podman volume rm test || break; done
while :; do bin/podman volume ls || break ; done

I have a slight feeling that this might solve containers#23913 but I am not to
sure there so I am not adding a Fixes here.

Signed-off-by: Paul Holzinger <[email protected]>
Luap99 added a commit to Luap99/libpod that referenced this issue Nov 7, 2024
If volume ls was called while another volume was removed at the right
time it could have failed with "no such volume" as we did not ignore
such error during listing. As we list things and this no longer exists
the correct thing is to ignore the error and continue like we do with
containers, pods, etc...

This was pretty easy to reproduce with these two commands running in
different terminals:
while :; do bin/podman volume create test && bin/podman volume rm test || break; done
while :; do bin/podman volume ls || break ; done

I have a slight feeling that this might solve containers#23913 but I am not to
sure there so I am not adding a Fixes here.

Signed-off-by: Paul Holzinger <[email protected]>
(cherry picked from commit 9a0c0b2)
Signed-off-by: Paul Holzinger <[email protected]>
@Luap99
Copy link
Member

Luap99 commented Dec 13, 2024

I haven't seen this once since #24479 so I assume that was it.

@Luap99 Luap99 closed this as completed Dec 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flakes Flakes from Continuous Integration
Projects
None yet
Development

No branches or pull requests

3 participants