Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Yet another missing-logs-and-events flake: journald? #24220

Open
edsantiago opened this issue Oct 9, 2024 · 6 comments
Open

Yet another missing-logs-and-events flake: journald? #24220

edsantiago opened this issue Oct 9, 2024 · 6 comments
Labels
flakes Flakes from Continuous Integration

Comments

@edsantiago
Copy link
Member

I've lost track of how many bugs I've opened for something-or-other like this. I'm going to lump together here the category of flakes seen in late 2024 where podman-remote logs is supposed to see something but doesn't.

Likely fix: change the tests so instead of podman wait; podman logs they do for 5 retries { podman logs; grep for what we want; retry if not there.

x x x x x x
sys(1) remote(2) fedora-40-aarch64(1) root(2) host(2) sqlite(2)
int(1) fedora-40(1)
@edsantiago edsantiago added flakes Flakes from Continuous Integration remote Problem is in podman-remote labels Oct 9, 2024
Copy link

github-actions bot commented Nov 9, 2024

A friendly reminder that this issue had no activity for 30 days.

@edsantiago
Copy link
Member Author

Two on Thursday, but not remote, so I don't know if they're the same bug or something new. The total so far:

x x x x x x
int(3) remote(3) fedora-41(2) root(5) host(5) sqlite(5)
sys(2) podman(2) fedora-40-aarch64(2)
fedora-40(1)

@edsantiago edsantiago removed remote Problem is in podman-remote stale-issue labels Nov 11, 2024
@edsantiago edsantiago changed the title Yet another podman-remote logs missing output flake Yet another podman-remote(??) logs missing output flake Nov 11, 2024
@edsantiago
Copy link
Member Author

This one is blowing up, and I'm tentatively blaming it on the recent VM update. Issue title changed accordingly.

x x x x x x
int(9) podman(10) rawhide(6) root(14) host(15) sqlite(15)
sys(6) remote(5) fedora-41(4) rootless(1)
fedora-40-aarch64(2)
fedora-41-aarch64(2)
fedora-40(1)

@edsantiago edsantiago changed the title Yet another podman-remote(??) logs missing output flake Yet another missing-logs flake: journald? Nov 12, 2024
@edsantiago edsantiago changed the title Yet another missing-logs flake: journald? Yet another missing-logs-and-events flake: journald? Nov 13, 2024
@edsantiago
Copy link
Member Author

cirrus-vm-get-versions, trimmed to remove packages that can't possibly (?) be causing this. New:

debian prior-fedora fedora fedora-aws rawhide
base 13.5 Generic Generic-41-1.4 ? 42-0
kernel 6.11.6-1 6.8.5-301 6.11.6-300 6.11.6-300 6.12.0-0.rc6.20241105git2e1b3cc9d7f7.52
conmon 2.1.12-3 2.1.12-2 2.1.12-3 2.1.12-3 2.1.12-3
containers-common ? 0.60.4-2 0.60.4-4 0.60.4-4 0.60.4-5
crun 1.18.2-1 1.17-1 1.18.2-1 1.18.1-1 1.18.2-1
golang 2:1.23~2 1.22.7-1 1.23.2-2 1.23.2-2 1.23.2-2
systemd 257~rc1-3 255.13-1 256.7-1 256.7-1 256.7-1

...and old (c20241016t144444z-f40f39d13, the VMs that were running fine):

debian prior-fedora fedora fedora-aws rawhide
base 13.5 39-1.5 Generic ? 42-0
kernel 6.11.2-1 6.5.6-300 6.8.5-301 6.8.5-301 6.8.5-301
conmon 2.1.12-1 2.1.12-2 2.1.12-2 2.1.12-2 2.1.12-3
containers-common ? 1-99 0.60.4-1 0.60.4-1 0.60.4-1
crun 1.17-1 1.17-1 1.17-1 1.17-1 1.17-1
golang 2:1.23~2 1.22.8-1 1.22.7-1 1.22.7-1 1.23.2-2
systemd 256.7-1 254.18-1 255.13-1 255.13-1 256.7-1

(I can't use the script's --baseline helper because these are different fedoræ)

Interesting observation: when we see this fail in system tests, it's often in the serial tests (the first pass). That points against it being a high-system-load issue.

So far, no luck reproducing on 1mt.

edsantiago added a commit to edsantiago/libpod that referenced this issue Nov 18, 2024
Get new systemd on rawhide, see what happens with containers#24220

Built in : containers/automation_images#394

Signed-off-by: Ed Santiago <[email protected]>
edsantiago added a commit to edsantiago/libpod that referenced this issue Nov 18, 2024
Get new systemd-257~rc1 on rawhide, see what happens with containers#24220

Built in : containers/automation_images#394

Signed-off-by: Ed Santiago <[email protected]>
@edsantiago
Copy link
Member Author

I guess this isn't too shocking, but flake is now showing up in APIv2 tests:

         not ok 1195 [27-containersEvents] GET libpod/events?stream=false&since=(T) : select(.status | contains("died")).Action
         #  expected: died
         #    actual: 
         not ok 1196 [27-containersEvents] GET libpod/events?stream=false&since=(T) : select(.status | contains("died")).Actor.Attributes.containerExitCode
         #  expected: 1
         #    actual: 

edsantiago added a commit to edsantiago/libpod that referenced this issue Nov 18, 2024
Get new systemd on rawhide, see what happens with containers#24220

Built in : containers/automation_images#394

Signed-off-by: Ed Santiago <[email protected]>
@edsantiago
Copy link
Member Author

edsantiago commented Nov 21, 2024

This is now our number one flake. I regret to say that systemd-257~rc1 (bumped in #24596) does not solve the problem: we're still seeing the flake in rawhide.

x x x x x x
int(15) podman(14) fedora-41(11) root(24) host(25) sqlite(25)
sys(10) remote(11) rawhide(8) fedora-41(2) (root)(2)
APIv2(2) test(2) fedora-41-aarch64(3) rootless(1)
on(2)
fedora-40-aarch64(2)
fedora-40(1)

EDIT: The weird "on" is because cirrus-flake-xref assumes a test name format (int/sys this that whatever) that APIv2 tests do not conform to. Sorry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flakes Flakes from Continuous Integration
Projects
None yet
Development

No branches or pull requests

1 participant