-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kernel oops page fault triggered by Docker in arc_prune #16324
Comments
I've seen this several times on several systems running ZFS 2.2.4 and Linux 6.8. Reverting to Linux 6.6 is stable. I have a trace which looks similar to yours ( An example of
An example of
|
I have the same problem on the latest Unraid 7.0.0-beta-1 prerelease.
|
I'm a mod at the Unraid forums and we have seen multiple users with this issue with Docker on zfs since kernel 6.8 (openzfs 2.2.4-1), there was also one report with kernel 6.7 during beta testing, call traces look all very similar, some example in case it helps.
|
We've been hitting what looks to be this issue ever since we launched our new infrastructure all running on the latest Ubuntu Kernel and ZFS. Once we had migrated some hundreds of container workloads we started experiencing crashes. We've been very unfortunate in our crash dump collection, but we've ascertained that the crashes we are seeing is very similar to this, please see the LXCFS issue linked above. Symptoms as we see them:
ZFS version where we've seen this is These workloads were all stable for years on older kernels. This is a real issue and I would not be surprised to learn that a lot of zfs users out there are being affected by this now. It took us a long time to track down the source of our crashes, and I expect others may be in the same situation. I believe this issue warrants immediate attention, especially since upgrading to the latest mainline-ish kernel and zfs does not seem to resolve this. |
Taking into account that this crash happens in a shrinkers-related code I can make a wild guess that this issue should be provokable by something like |
I tried that and it did not crash my system or trigger the issue. AFAIK the ZFS ARC is separate. My guess would be it's hitting a race condition of sorts on heavy IO where it has to be evicting a lot out of ARC. |
I have a similar issue with kernel 6.8.12, docker 27.0.3, and zfs 2.2.4 (zfs-2.2.4-r0-gentoo) on Gentoo. My trace does not include arc_prune but zfs_prune.
|
@1JorgeB any chance switching storage drivers might be a reliable workaround until this gets resolved, versus using a btrfs image? |
same issue here, on 6.8.9. crashes after an AI training workload for a few hours. will revert to 6.6. EDIT: 6.6 is stable (am not using docker, just standard filesystem reading and writing). |
FYI OpenZFS has supported docker's overlay2 storage driver since 2.2. See moby/moby#46337 (comment) I gave up on the docker zfs storage driver some time ago, as it was pretty buggy. If you are starting docker with systemd you can modify the startup line to be:
Using overlay2 I don't have any issues loading |
Similar situation here, this time with 6.10.4 and zfs 2.2.5-1 on Debian. This occurred during a I was forced to power cycle after this, which incidentally upgraded my kernel to 6.10.6, and the same pull succeeded fine afterward. But the ARC conditions would also be entirely different after a fresh boot so I doubt the kernel upgrade mattered. Just some info. FWIW, my kernel is in lockdown mode due to secure boot.
|
Still reproducing in ZFS 2.2.6 and kernel 6.10.8-zen1-1-zen. |
I was playing with this one today and trying to reproduce it:
I've also enabled SLUB debugging (in
and KFENCE:
With no results, unfortunately. I also tried to limit physical memory amount from 256GiB to 128G, 64G and 32G with the same results. No crashes on Likely, it's a tricky race condition. |
@mihalicyn maybe a silly question, but did you use Docker's |
not a silly question at all ;-) When debugging stuff everything must be checked twice! Yeah, I do have ZFS storage driver enabled manually:
|
That is very odd, it crashes every single time for me. It's a guaranteed crash whenever I run that command. I'll see if I can make a VM that reproduces. In the meantime, I can provide any debug log asked. |
Please can you try to enable SLAB debugging with
That would be awesome too! |
Alright, I've been working on this for a bit and I haven't been able to reproduce on my laptop, nor in a VM on said laptop. I then did a quick sanity check and yup, still crashes first time on my desktop. So I decided to instrument the desktop, and then... nothing. It seems like The trace I got this time however is different: it tried to execute an NX-protected page? The only thing of note I can think of that might be a contributor is this is a Threadripper 1950X system, 16C/32T with 32 GB of RAM, so it's a NUMA system and relatively high core count, which leaves a lot of room for a race condition. Maybe if others in here can share their specs we can correlate some things.
|
Additional information that I think could help narrow it down: I think it's possibly related to the creation and destruction of datasets and snapshots. I never see it die during the extraction, I see it die at the very end of it when it commits the Docker layer, whatever it's doing.
That smells like a use after free race condition triggered by changes in datasets and snapshots. Is there an existing stress test for that I could try? I'll see if I can write one later this weekend when I have more time. |
Thanks for doing that!
yeah, it can make things way too slow. You can play with parameters
You can try to make |
I've observed the same error. Context:
|
I've reproduced on every system I have in production, which are all "above average" core count systems. AMD Threadripper 3990X (64c/128t), Ryzen 3900X, 5900X (12c/24t). I can't try slub_debug or KFENCE in production, but I could try within KVM on the same systems. |
We also observed this across many high core count systems, in our case dual Xeon Platinum systems with 56C/112T |
Hi @TheUbuntuGuy,
actually, you can enable KFENCE in production. It is designed to be enabled in production to debug issues like this one. |
I reverted the kernel on my production systems to v6.6 as they were unusable due to this bug. I meant that I couldn't test using those options due to it being a production system and I can't crash it. I will try in a VM using the same CPU layout when I get the chance, probably with some automation to try the crash over and over, since KFENCE is sampled and may not catch the issue quickly. |
...and here are the traces after running Trace 1 decoded
Trace 2 decoded
|
I have a solution that works for me, would be great if others can validate: vpsfreecz@74964ac (I'm not 100% sure it will compile outside vpsfreecz kernel, but the problem is understood and I'll get to a PR for it once I hunt down some more issues, working on it furiously) |
@snajpa, your patch applied cleanly and compiled against 6.10, but it still crashes for me, however with a different trace.
|
@TheUbuntuGuy thank you! I'm actually hunting for multiple bugs now and there are places where I think it's possible that we're racing with (it's a work in progress, but it shouldn't eat data, it has known bugs in it still, but let's see if it changes your situation and how, if you agree) |
I wanted to keep the number of variables as small as possible so I massaged the patch into 2.2.6 release. It compiled fine, but won't import any pools (I just get I/O error). I will have to switch to the head of master and try again. |
I've applied both patches to master and it still crashes. I tried several times and it locks up in such a way that userspace doesn't get the kernel oops, so all I have is what came off the framebuffer, which is incomplete unfortunately. I did see some errors from Docker about corrupt layers before it crashed, so there is either corruption within ZFS, or memory from Docker happens to be getting overwritten. |
Well that sucks... yup it seems like there might be some znodes that aren't in perfect order stemming from a bug related to O_TMPFILE file creation - and also a trouble with lifetime of znodes of O_TMPFILES in combination with overlayfs->zfs->renameat2 RENAME_EXCHANGE and maybe also trouble with lifetime of RENAME_WHITEOUT znodes, probably also in combination with RENAME_EXCHANGE primarily. Now if you delete the affected dataset, the affected znodes should go away fine I think. I'm now scratching my head thinking about whether your workload w/ Docker can produce such broken znodes reliably, can it? If you roll those two patches back, how easy would it be for you to reproduce it? For me it takes ~10hrs of gentoo+fedora+debian images bootstrap before I hit any manifestation... so if you or anyone else has a faster reproducer that would be most awesome <3 Btw the current state of the second patch is 37ee1e9 - I think that has all the currently known problems fixed but as I said, 10hrs, oh man... :D I'll report if I broke something new I'm not expecting |
I'm using a 128-core Threadripper VM. I concurrently pull 15 different Docker images (from a local registry mirror), each with about 7 layers. They are different versions of the same private image so I can't share them unfortunately, but the contents don't really matter, just the number of layers, since that is what exercises the ZFS graphdriver. I can always produce a crash within 1 minute every time. I'll try your latest patch. |
FWIW my machine from my comment up-thread is an Epyc 7402P (24-core, single seat) with 128GB of RAM in a 16x8 full 8-channel config. Motherboard is Supermicro H12SSL-CT. |
I setup a serial console so I can get complete traces when it locks up hard. I got the following 2 traces with 37ee1e9: an oops and a panic. There may be more failure modes, but this was just the first runs. OOPS 1
Panic 2
|
@TheUbuntuGuy thx! do you have block cloning enabled? can you try loading zfs module with |
I just tried disabling block cloning and I get the same GPF. Since it's a VM, after each crash I am rolling it back to before Docker has any images, so every layer dataset is created new. However the parent dataset has existed since the OS was installed. |
@TheUbuntuGuy thank you sir! can I ask if you can use b4e4cbe as the base? Since it's in the write pipeline it makes me think I don't see it b/c I'm at a commit where the Direct IO is still disabled... otherwise I'll have to come up with new angles from which to look at this |
I ran b4e4cbe + vpsfreecz@74964ac + 37ee1e9 I get a pretty bad panic
Another run gives the same GPF:
|
that's interesting, it looks like another bug to solve, probably in the direction of dbuf dirty detection logic and... there be dragons... do you think you'd be able to figure out a reproducer you could share? I can currently run it only on a box of half that size but I'm very interested in any and every thing that can potentially give me a crash sooner than in hours :) |
If I understand you correctly, you have 64 cores? I will shrink the VM to that. It's using 8GB of RAM so that shouldn't be an issue. I'll see if I can find a set of public images that reproduce the issue. If you don't have a Docker registry you can use, I bet you can reproduce with loading images from tar archives. I'll check and get back to you. |
yup it's the first epyc, dual 32c (=128 threads, but not cores, only 64 of those); I don't work with docker a lot but I'm willing to do anything that will help me fix this, even setup a local registry :D |
but anyway the key will be the relatively small amount of RAM for such a number of cores; I was trying with 10G/16 threads VMs up until now, I don't have more parallel reproducer, what you're doing sounds like something I want :D |
I think any relatively large container with multiple layers should reproduce fairly reliably. You can download a few versions of the one I posted originally and push them to a local registry and then pull from it over LAN and it should trigger the bug pretty quickly. I tried to reproduce in a VM but I've been hitting unrelated bugs on my test laptop that causes the NVMe to crash completely on the host, but still hit the bug within minutes on my Threadripper desktop (which I'd like to not crash too much). |
If using Docker Hub directly there are 2 main problems you will run into. The first is that the download may not be fast enough to trigger the issue. The second is that you will quickly reach the API limit for image pulls and you will get your IP blocked. This is why I would get a local registry spun up. It can either be on a different machine, the host, or within the VM itself. I tried to just load images from both compressed and uncompressed tar files (to not need a registry), but I can't seem to reproduce easily. I suspect the network activity and decompression provides enough delay/jitter in the execution to exacerbate the issue. I have a registry cache setup, so that after the first image pull, they are served locally on my LAN. I can reproduce with the following (random images I found with various sizes and layer counts): IMAGES=(postfixadmin:3.3.13-apache postfixadmin:3.3.12-apache postfixadmin:3.3.11-apache \
postfixadmin:3.3.10-apache gradle:8.10-jdk23-jammy gradle:6-jdk11-focal gradle:jdk17-jammy \
gradle:jdk11-focal jetbrains/qodana-php:2024.1 jetbrains/qodana-php:2024.2 \
jetbrains/qodana-php:2023.2 jetbrains/qodana-php:2022.1-eap linuxserver/mysql-workbench:8.0.40 \
mysql-workbench:8.0.36-ls223 localstack/localstack-pro:latest-amd64 \
localstack/localstack-pro:latest-bigdata-amd64 localstack/localstack-pro:3.8.1)
for i in "${IMAGES[@]}"; do
docker pull "$i" &
done If you setup a local registry, pull the images, tag them with the hostname of the registry, push them to that, then change the image names in the script to include the FQDN of your local registry. Or you can setup a cache server like I have and it is all transparent. You will just have to configure the Docker daemon in the VM to use that cache server. |
yup thank you both just waiting for the last 16 local-ai versions to download, also learnt about the nice skopeo tool, let's get those stack traces flowing in fast -- as I said I don't work with docker a lot :D I'll go with the dumb local registry + tags for now, bash oneliner spaghetti ftw |
Here is another stacktrace from zfs2.2.6 without any additional patch
Here is the command to decode the stacktrace using the script from kernel source:
In ~/debug there is
For DKMS set this in /etc/dkms/zfs.conf (tested on debian)
|
@TheUbuntuGuy control question if I may: without any patches, do you see a similar line in your kernel log while hammering docker pull to this one? ->
I'm currently at vpsfreecz@f43dcf7 for the zfs_prune stuff and vpsfreecz@4ae41fa (updated) for a lot of wild shit combined, but I'm pretty sure there's other stuff I haven't discovered yet that I've broken. Just to be on the safe side please try the second patch with a disposable pool first :) It's heavily WIP. But now I'm at a point where I don't see the panics you've reported that I've been able to reproduce easily with your help. Thanks! |
I will test your patches tonight. I haven't seen that message, but I wouldn't expect to since I'm not using overlayfs. I am using the native ZFS graphdriver, as I'd been using Docker on ZFS long before overlay mounts were supported. |
oh right lol :) I have a mix of it all on that one machine, this was from a run with upstream kernel without syslog namespace, it comes from docker in lxc which uses ovl |
OK it doesn't pass further testing, I'll post once I have something which does. (so far found one accidentaly deleted line, updated above) |
Tried vpsfreecz@f43dcf7 and vpsfreecz@4ae41fa and got some new traces; posting just in case they are different from your testing. Traces
|
@TheUbuntuGuy thanks a lot, can you please try with |
either I'm just hilariously wandering in the dark and am creating more problems on the way than I'm solving, or we're really uncovering a bug after a bug in low memory conditions... Update: |
System information
I'm holding to 6.8.9 specifically to stay within official supported kernel versions.
Describe the problem you're observing
Extracting large container images in Docker causes ZFS to trigger an unhandled page fault, and permanently locks up the filesystem until reboot. Sync will never complete, and normal shutdown also doesn't complete.
Describe how to reproduce the problem
Running this particular container reliably hangs ZFS on my system during extraction, using Docker's ZFS storage driver.
It gets stuck on a line such as this one and never completes, killing the Docker daemon makes it a zombie, IO is completely hosed.
Include any warning/errors/backtraces from the system logs
The stack trace is always the same. Disk passes scrub with 0 errors after rebooting.
The text was updated successfully, but these errors were encountered: