-
Notifications
You must be signed in to change notification settings - Fork 168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kernel Panic or Lockup on DS3615xs image #21
Comments
As @WiteWulf I have the same issue on ESXi VM. When I run only simple nginx container... randomly...
CPU jump to 100% |
There have been a couple of further reports from users with Celeron CPUs, J1900, so not just limited to Xeon architecture. All running DS3615xs image. |
Hello, I tried with DSM 6.2.4, DSM hangs once docker start... must reset the VM.
Edit : Actually it is not only docker. I installed Moments on 6.2.4, and tried to import 20 files in a raw... System froze :
@WiteWulf could you try ? Moments seems to be continuing to import images, but very slowly, and DSM is unresponsive. Edit 2 : now once moments app start, it hangs DSM... I can't block it to start. |
After uploading a batch of photos with face detection enabled, the same happens on baremetal running DS3615xs 7.0.1 with AMD CPU . I get consistent hangs/freezes. Unfortunatelly i have no serial console to get the soft lockup output. I will try to increase the watchdog timer to 60 and see if this helps : echo 60 > /proc/sys/kernel/watchdog_thresh sysctl -a |grep -i watch |
Can you confirm:
...and can you post the serial console output when you get a hang/freeze? Hang/freezes have only been observed so far on virtualisation platforms (such as ESXi and Proxmox), with kernel panics on baremetal. |
I'm on 7.0.1-RC1, so don't have Moments, but I installed Photos (I wasn't using it previously) and uploaded ~250 images to it. CPU usage barely got above 20% (looks like it's single-threaded) but disk writes were much higher than when I've seen kernel panics when running docker stuff that crashed (this makes me think this isn't related to high disk throughput). The server did not kernel panic or lock up. |
@WiteWulf I confirm Synology Photos does not suffer the same issue... My DSM 7.0.1 RC does not have issue too. |
@OrpheeGT You get a soft lockup on fileindexd. Which will be called in various times for various application indexing. [ 1230.717458] BUG: soft lockup - CPU#1 stuck for 41s! [fileindexd:22537] Increase the watchdog threshold to 60 (Seconds maximum) |
I noticed it was actually fileindexd that was named in the kernel panic output you posted, so I pointed Universal Search at a ~2TB music folder on my NAS to see if that crashes. And...BOOM! synoelasticd (another database process) kernel panics:
|
@WiteWulf Yes last thing i saw while running htop was synoelacticd as well |
So far I've seen kernel panics mostly related to:
Elasticsearch seems to be the back-end used by Synology Universal Indexer), and the others have been installed by users in docker containers. |
So title can be renamed. it is not only docker... @WiteWulf you made it crash on DSM 7 ? |
Yes, Universal Search crashed on DSM7 while indexing a large folder (many large files). Changed title to reflect not just docker causing issues. |
Synology Photos must work diffently because I added to it like 20 images in a row like I did with Moments, but no issue with it... |
I may have failed by I remove the register_pmu_shim line where @ttg-public requested for @WiteWulf then I tried to import a 20 files in a row with Synology Moments app :
So it still does the same behavior for me... |
I confirm that the universal search on the DS3615 image freezes the system on both physical and virtual even with the latest code. Looks like the redpill module is crashing as i get ATA disk and btrfs errors. On DS918, It does not have the same effect. |
@labrouss can you confirm that your baremetal/physical system is freezing and not kernel panicking? All other reports I've had indicate kernel panic followed by reboot on baremetal, and freeze/lockup on virtual. It's important to be clear what you're experiencing here. |
As per @ttg-public's suggestion on the forum, "Can you try deleting the line with "register_pmu_shim" from redpill_main.c (in init_(void) function) and rebuilding the kernel module?" I rebuilt the loader without the PMU shim and am still seeing the kernel panics when loading my influxdb docker container.
|
For the record, I am not experiencing this issue. Using universal search/having it index the entire home folder does not result in any issues. I don't have a lot of data in home folder, as this is a test VM so not sure if your problem only occurs with higher I/O load and lots of data to index? I am running DSM 7 41222 RedPill 3615xs on ESXi v7. |
Could you just try to install docker and influxdb container on it ? |
I found that simply indexing a few photos didn't cause problems, but pointing it at a very large music folder (2TB, 160k files) caused a kernel panic after a few minutes. |
Both baremetal and physical is not panicking and not rebooting but are having soft lockups. The issue with the universal search is consistent and has been verified numerous times. I have a VM running an old image of the DS3615 with RedPill v0.5-git-23578eb. Universal search does not have the same effect. It works as expected. You can find your version of redpill module by running : dmesg |grep RedPill You can also do a "git checkout 23578eb" , recompile, inject the module into your rd.gz and check |
Interesting that that commit is related to the PMU shim, notably bug fixes and enabling it by default. So, to review:
Thanks for the info! |
The latest redpill code (3474d9b) actually seems more prone to kernel panicking than previous versions. My influxdb container was guaranteed to crash the system every time I started it, but the others I had were typically stable and non-problematic. I'm now seeing immediate kernel panics when starting a mysql container that previously didn't cause any problems. I've gone back to a slightly older build (021ed51), and mysql (operating as a database backend for librenms) starts and runs without problems. (FYI, I'm not using 021ed51 for any reason other than that I already had a USB stick with it on. I don'r perceive it to be more stable than any other commit). |
As an experiment, I took the HDDs out of my Gen8 and put an old spare drive in, booted it from the stick with redpill 021ed51 on it and did a fresh install. I installed a load of docker containers (mysql, mariadb, influxdb and nginx) and pretty soon mysqld kernel panic'd:
I basically wanted to establish whether or not the crashes were related to data or configuration carried over from my previous 6.2.3 install, and it doesn't look like it is. |
Quick follow up on the above: I had to push the system to crash. Influxdb didn't do it straight away as it does on my "live" server, nor Universal Search indexing a few GBs of data. Once I booted it back into my live system it crashed again starting up docker and came back up scrubbing the disks. While disks were scrubbing any attempt to start up docker would cause another kernel panic, but leaving it to finish scrubbing over night I was able to start everything but the influxdb container without trouble. This suggests to me that the problem is related to load somehow. It's nothing as obvious as CPU load, or disk throughput, as I can quite happily stream 4k video and transcode 1080p content with Plex on this system. It's something harder to pin down. |
I see you are running 7.0.1 42214, can you try with a 7.0 loader instead ? As i tested that though its not possible to recover to an earlier version, you will have to reinstall. |
gen8 8 x Intel(R) Xeon(R) CPU E3-1265L V2 @ 2.50GHz |
Via Google translate, for the benefit of the non-Chinese speaking:
|
@WiteWulf |
It's been covered in the depth on the forum both in the main Redpill thread and it's own dedicated thread. Both easy to find with the search function. ThorGroup tend to reply to the forum threads at most once a week, there was a big update from them overnight where they addressed this issue many times. |
well. If there is any solution, please @ me.. Thank you. |
Sorry, no, that’s not how this works 😀 If you want to keep up to date either “follow” the repo or issue here on GitHub or the thread on the forums. Use the tools available to you… |
We don't want to say that but... we THINK we nailed it down ;) It looks like we were accidentally touching some of the registers which weren't preserved by the caller. This is an exception with a few syscalls in Linux (the full stack frame is not preserved for performance reasons). After a numerous articles and a plenty of research we concluded it's not worth fighting for literally 1-3 CPU cycles per binary and go ASM-crazy (as in comparison, access to MEMORY is one-few hundred times slower, so it only matters on a datacenter-scale with millions of executions) and added an unconditional jump to the The fact it worked on 918+ seems to be a sheer luck of the GCC used to compile the kernel on 918+ didn't touch these registers (or didn't touch them enough to cause a problem). These "random" soft lockups were caused by the interaction of the swapper process and context switching. That's why docker + a database was a PERFECT triggering scenario as it stresses the memory, uses swap extensively, executes many processes and switches contexts like crazy (which is, as a side note, why database in a container in any performance-oriented applications is really hard to pull off correctly). There are two tests we were running which consistantly crashed the 3515xs instance on Proxmox: # Test 1 - installing and removing mysql container; usually crashed 512MB instance within the first try/seconds
while true; do
sudo docker rm -f test-mysql ;
sudo docker image prune --all -f ;
sudo docker run --name test-mysql -e MYSQL_ROOT_PASSWORD=test -d mysql:latest &&
echo "OK"
done
# Test 2 - on instances with more RAM stress-testing the server usually triggered the crash in under a minute
docker run --name test-mysql -e MYSQL_ROOT_PASSWORD=test -d mysql:latest
docker exec -it test-mysql bash
# execute mysql -u root -p => then execute any SQL like SHOW DATABSES;
# for some reasons running mysqlslap right away ends up in denied connections?
mysqlslap --user=root -p --auto-generate-sql --number-char-cols=100 --number-int-cols=100 --detach=10 --concurrency=200 --iterations=5000 That being said once we debugged the issue the fix wasn't that difficult - the proper stack protection and indirect jump seems to fix the problem. This is really the commit which introduced the mechanism: 122470d (it's slightly noisy as we moved one file, the important change is with comment to the @WiteWulf: your reports were very useful. Can you poke around and see if the new version fixes the problem? We tested it in multiple scenario and weren't able to crash it but.... p.s. If anyone wants to get nerdy about performance impacts CloudFlare has a nice blog post about a similar problem: https://blog.cloudflare.com/branch-predictor/ |
Yes ! At least now universal search (synoelasticd) doesn't crash the system ! This was the most consistent crash and i was able to replicate in both physical and virtual machines, which in my case varied from AMD to Intel CPUs, old and newer CPUs etc. I will continue testing |
To be disregarded, did not git pull... my bad. |
Well, it's definitely an improvement, but not fixed. This is just a "feeling", but I can get more containers running than I was on 3474d9b before it crashes. I had all my docker containers running, and was importing a load of data to the influxdb when it crashed:
Notably it wasn't the influxdb or containerd-shim processes that it flagged this time, though. It was the python script I was running to gather the data and import to the influxdb. Plex Media Server next... Nope:
After the reboot I just left Plex running in the background and it went again without me even pushing it with any updates/scans. |
I left the Test#1 running and had no crash until i reached the pull limit ! Thats definetely an improvement. I need though to test it on a slower CPU also. |
@OrpheeGT feeling brave enough to upgrade to 7.0.1? :) |
I just recompiled it.. After several hours of testing, it didn't crash. |
@WiteWulf: docker run --name influx-test -d -p 8086:8086 -v $PWD:/var/lib/influxdb influxdb:1.8
docker exec -it influx-test sh
# inside of the container:
wget https://golang.org/dl/go1.17.1.linux-amd64.tar.gz &&
tar -C /usr/local -xzf go1.17.1.linux-amd64.tar.gz &&
rm go1* &&
export PATH=$PATH:/usr/local/go/bin &&
go get -v 'github.com/influxdata/influx-stress/cmd/...'
/root/go/bin/influx-stress insert -f --stats Indeed this causes a lot of CPU usage and plenty of swapping, along with sparse
We found one more place which can technically cause an infinite loop under very specific circumstances but that shouldn't cause hard lockup crash even if it occurs (see 0065c89) @OrpheeGT: your results look normal. The CPU usage peaks due to moments processing. The reported locks/unlocks are actually our debug messages to diagnose if something crashes within the lock - if you see |
I did not hide anything actually. Most if the time I watch live telnet serial console instead of /var/log/messages |
@ttg-public on the forums you posted It's significant that I never had this on 6.2.3 with Jun's bootloader, but have seen it on 6.2.4, 7.0 and 7.0.1 since moving to redpill. I appreciate that it may be an issue introduced by Synology between 6.2.3 and 6.2.4, but imho it's more likely an interaction between redpill, the kernel and nmi watchdog. One way to test this would be to have redpill load 6.2.3 and see if the problem persists. I don't know how much work would be involved in creating a 6.2.3 build target but it would potentially isolate the problem to either redpill or a change introduced by Synology in 6.2.4. FWIW, I've been running this system with nmi_watchdog=0 since Saturday night with nothing obviously going wrong. |
May I ask do you know why it does not happen with Jun's loader (soft and hard lockups) ? were you able to catch it ? |
This looks interesting: 19f745a |
I just built a new DS3615xs 7.0.1-42218 image using the latest code commit (0df1ec8) and booted with nmi_watchdog running. The system survived the docker containers starting, but as soon as Plex started updating stuff it kernel panic'ed. I've rebooted with 'nmi_watchdog=0' set and it's stable again. |
Hey folks, big development over on the forums: tl;dr disabling my NC360T PCIe NIC and instead adding the tg3 extension and using the onboard NIC massively improves system stability. I never thought to ask others suffering with kernel panics if they were also using a PCIe NIC. I'll gather info and report back. |
Hmmm, not looking so good in the cold light of day :) Multiple reboots overnight when Plex started doing library updates and media analysis. I've asked Kouill to share their boot image with me so I can test on my hardware. I'll report back |
I've found a strange behaviour with the NMI watchdog while rebooting and testing. I currently have nmi_watchdog=0 in my grub.conf to keep the system stable. As demonstrated previously the system does not kernel panic under load with this. I wanted to test a new bootloader and accidentally left nmi_watchdog=0 in the json file, so when the system booted it wouldn't crash. I spotted this and did 'echo 1 > /proc/sys/kernel/nmi_watchdog'. The kernel confirmed this on the console with 'NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.' but subsequently running the influx-test it does not crash. This is reproducible: booting with NMI watchdog enabled results in the previously demonstrated kernel panics, but booting with NMI watchdog disabled then enabling it once the system has finished booting leaves it stable. |
Has the problem been solved? Can I upgrade? |
I would say this was a workaround rather than a solution. As with all things related to redpill, it comes with a risk of dataloss: don't use it with data you don't have backed up elsewhere or don't care about losing. |
A number of users on the forum are reporting kernel panics on baremetal installs, or lockups of guests on virtualisation platforms, when running the DS3615xs image specifically. This is typically precipitated by running docker in general, or certain docker images, but may also have been caused by high IO load in some circumstances. A common feature is use of databases (notably influxdb, mariadb, mysql and elasticsearch) but also nginx and jdownloader2.
This has been observed on baremetal HP Gen7 and Gen8 servers, proxmox and ESXi with a variety of Xeon CPUs (E3-1265L V2, E3-1270 V2, E3-1241 V3, E3-1220L V2 and E3-1265L V4), Celeron and AMD.
Most users are on DSM7.0.1-RC1, but I also observed this behaviour when on DSM6.2.4
(edit: also confirmed to affect 7.0 beta and 7.0.1, ie. not the release candidate)
Conversely, a number of users with DS918+ images have reported no issues with running docker or known problematic images (in my case influxdb causes a 100% reproducible crash).
On my baremetal HP Gen8 running 6.2.4 I get the following console output before a reboot:
This is seen by others on baremetal when using docker. Virtualisation platform users see 100% CPU usage on their xpenology guest and it becomes unresponsive, requiring a restart of the guest. The majority of kernel panics cite containerd-shim as being at fault, but sometimes (rarely) it will list a process being run inside a docker container (notably influxdb in my case).
This is notably similar to an issue logged with RHEL a number of years ago that they note was fixed in a subsequent kernel release:
https://access.redhat.com/solutions/1354963
The text was updated successfully, but these errors were encountered: