-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VU+ kexec and kernel updates #877
Comments
So this is a failure after a software update? |
So, when is |
At the beginning I've considered this possible issue. Since the kexec multiboot should be distribution agnostic, changing the kernel-post-install script from my point of view isn't a nice idea, so I'veconsidered a mount -o bind real-kexec-img /dev/mmcblk0pxx |
The OpenPLi kirkstone build wrecked a lot of boxes because I agree that under normal circumstances it shouldn't be rebuild, but bitbake moves in mysterious ways... |
I‘m a little bit confused because openPli is using it’s own build system. So why do you create an issue here? Or do I miss something? |
The offending code is present in the BSP of all images, including OE-A. So the same will happen if you would for example update OpenATV in a slot and it has a kernel update. I posted this here to get consensus on a solution, we're all in the same boat. |
The postinst probably needs to be something like
The test for /proc/cmdline is probably not needed (given the earlier test for /proc/stb, but I've seen that after the issue occured and the box has rebooted into slot 0, /proc/cmdline doesn't exist, or exists but is 0 bytes. |
@dpeddi The problem is that the kexec kernel that was installed in the kernel partiion is overwritten by the postinst of the kernel of the This can be fixed by making the postinst kexec aware (see above), but that doesn't fix it for older images. As the kexec kernel has been wiped, I'm not sure if this could be addressed from within the kexec scripts, as they won't run anymore. Since /usr/bin/kernel_auto.bin is still present on the box, it might be possible to revive the broken box simply by doing
(after determining what the MTD KERNEL device should be). |
Or much easier, flash the kexec version of kernel_auto.bin with an usb stick to revive the box. |
That does the same as my manual Is that simply copying kernel_auto.bin from /usr/bin to /vuplus/ on the USB stick? Given the fact we will never be able to address this retroactively (for images already build), we need to have some procedure ready for users having this problem, so they don't start with a standard USB flash again, and wipe out all multiboot slots... |
Appearently nobody here sees this as an issue that needs addressing? |
Dpeddi ( the originator of the kexec kernel) is on holiday, so better to wait for his return. |
Will give a look within a week or two |
Within days we've had several people with this issue. It only occurs if the kernel-image package is in the updates. This happened for all when the VU+ recipes were altered due to code.vuplus.com going down, it also happens sometimes when changes are made to the BSP during development, and it happens in OpenPLi when people have installed a release candidate, and do a software update after the version is released (which upgrades the RC to the release version). |
Ok, so I will ask again. Why is the package being updated when there is no change. I just checked our previous image version and the package name is identical. So why is the package name changing on PLi? I know this is not the answer to the problem, but is the reason we don't see it. |
The name doesn't change, the PR does after a new build, so opkg sees it as an update. I agree there should not be anything to update, afaik none of the VU+ BSP changes has an influence on the kernel build. But bitbake decided to rebuild the kernel. Which gave the kernel-image package a new PR. Come to think of it, chances are you won't really encounter this problem in OE-A images (under normal circumstances) as you don't use a PR server, so even if the package is built again, it will have the same PR, so opkg will not see it as an update. |
The post-install could fix your latest image, so it could be ok... But if someone would install an old openpli then update, they will overwrite the kexec-kernel Probably should be implemented both the post-install and some additional protection with mount -o bind /slot/kernel.img /dev/mmcblk0x in the kexec initrd |
The kernel was rebuild because the kernel source was also downloaded from code.vuplus.com, so that was a SRC_URI that needed changing in the BSP. Again, OE-A didn't have that problem either, because everything is forked and copied to some OE-A specific location. |
That is beyond me I'm afraid. |
@Huevos to be specific: OE-A images don't have an issue in this specific case (because the dependency on code.vuplus.com wasn't there), but will have the same problem as soon as there is a reason to manually bump the PR of the linux kernel recipe. |
Yes it was, we only changed that SRC_URI a few weeks ago, but not the PR because the code is identical. |
Like I said, we use a PR server, so we don't control the PR, bitbake does. And it bumped the PR when the SRC_URI changed. And for images that use a hardcoded PR, they also will have the issue when they need to bump the PR. Meaning that altough it wasn't an issue this time, it may become one in the future, so imho it is worth thinking about it, and not ignore it. |
We will update it when @dpeddi is back from holiday. |
@WanWizard The guest image remount /dev so the overriden device become invisible. So the solution you propose is the best available. I think we will include it in oe-a with credits to you. |
And something in the initrd of slot 0 that can detect the wrong kernel has booted? Because when the issue happens, half the filesystem is missing (like /sys, completely empty). Also, my postinst suggestion has been written from the top of my head, not tested, so please double-check it. |
Multiboot kernel consists of:
If the guest would flash the kernel it write a non kexec-kernel. Without the kexec kernel no initrd is called so we can't implement what you are asking So the next reboot it would start a kernel that could be misaligned by the kernel modules and the filesystem of the "recovery" image. However what you describe is a bit strange. Usually no kernel modules are needed during normal boot to get all the file systems mounted and ethernet connectivity. On which box did this mount issue happened? Which recovery image was used? Which guest image? |
My idea was that after the box is flashed, the user opts for installation on multiboot from Enigma. This installs the kexec kernel, and my suggestion was to also include something in the kexec package, to be installed in /etc/init.d/ in slot 0, that can detect that slot 0 was booted because the kexec kernel in flash was overwritten. I had OpenPLi develop in slot 0, but people with other images in slot 0 have reported the same, if the kexec kernel is overwritten due to a kernel update in a multiboot slot, the box reboots in slot 0, with the kernel that was written to flash in the update, but /sys was empty (which means that for example bootargs can't be read). I agree with you that apart of some differences in kernel defconfig, all kernel images should be the same as VU+ has never updated one, so I can't explain what is going on. I can only report what I've seen myself, and what others reported: after the kexec kernel is overwritten, the box boots slot 0, enigma starts, but crashes when you start using it. |
My idea was that after the box is flashed, the user opts for installation on multiboot from Enigma. This installs the kexec kernel, and my suggestion was to also include something in the kexec package, to be installed in /etc/init.d/ in slot 0, that can detect that slot 0 was booted because the kexec kernel in flash was overwritten. @Huevos, what do you think? |
True. I was thinking that, as the image in slot 0 is unusable anyway, when it issue is detected, simply dd the kexec into flash again ( which should still be in /usr/lib ), and reboot the box, which should fix it and boot the original slot again? |
@WanWizard - so do you know where the box crashes? If its in E2 its probably manageable as all the components are available to fix it. |
To get debug, I guess I can add code in slot 0 to write the original kernel to disk and then reboot to say slot 1 and see what happens. @WanWizard - would that test setup match the issue as you see it? |
Several locations, but the only one I have available from a user report is
Which happens on an OpenPLi image when you to into the multiboot selection screen. Which is triggered because /sys is completely empty. |
I don't think this complexity is needed. When the issue occurs, slot 0 is always booted (as writing the guest kernel to flash effectively wipes out multiboot), and the kexec kernel file is still available:
so it could be dd'd back into flash, reboot, and the box will start it's original slot again. ( see the user instructions I wrote: https://wiki.openpli.org/Vu_Multiboot#Multiboot_images_missing_after_an_update.3F ) |
The enigma2 code is really generic. If it found STARTUP file in the scanned location, it switch to multiboot mode. So if /sys/firmware/devicetree/base/chosen/bootargs is not available enigma2 should switch batch to single boot mode and alert the user that multiboot is not available and it could be necessary to reinstall it, but i don't know if it possible to create a popup, but for sure we could trigger it to non multiboot mode and let the user to fix by reinstalling multiboot. |
The problem isn't Enigma, the problem is the wrong kernel is written to flash, which can be detected and fixed, so I don't really see why we need changes to Enigma. Also, you only get this when you go into that specific screen, but if you don't, other issues will appear, as you're running only half an operating system. This can lead to issues for the user (loss of functionality or even data), and for us (increased support requests). So can we keep an eye on the ball please, and fix this issue instead of working around it? |
Enigma should manage the missing of /sys/firmware/devicetree/base/chosen/bootargs If we add a try catch or a check for the presence of the file we can alert the user that he could had fuck off the multiboot and he should fix it by reinstalling it. I think that's the proper way to proceed since it could work with other guest image too.. for sure we need the post_install workaround |
It is not only that, all of /sys is missing. /boot isn't mounted properly, and there are more issues. The last thing we need is allow Enigma to start which gives the impression that all is fine, they won't even realize that there is an issue until something serious happens, or they want to boot another slot, and realize they're gone. It should be adressed as soon as slot 0 boots, not worked around in Enigma by an end-user that doesn't have our skillset. |
Hello WanWizard, The solution that we are going to implement in oe-a take some idea from what you suggested and some improvements however still needs testing.
The rest of the solution will need an updated recovery image.
The recovery script could be put in the default startup and it will try to fix the Recovery Image.... it check for the presence of /STARTUP and /STARTUP.cpio.gz. If present it assume the running image is kexec multiboot enabled, will check if /sys/firmware/devicetree/base/chosen/bootargs is missing, check if the last selected startup is in flash or not and locate the path to the guest kernel, then it dump the kernel in flash to the located path and flash the kexec kernel in the flash again then reboot
If isn't possible or the power user doesn't want to reflash the recovery he could run something like follows to be prepared to the situation
I don't like so much the auto-recovery in startup by default because if something go wrong the user doesn't have the possibility to backup its files. Maybe there are still some bug to fix (I'm still going to complete all the test on a spare box) but feel free to give a check and report if it seems good to you. Thank you for rising to us these issues |
Thanks 👍 . I'll have a look and try to keep OpenPLi in sync. |
There've been long standing complaints that under certain conditions, the VU+ kexec multiboot isn't stable, which may lead to a broken kexec system (so the slot 0 image boots) or a non-booting box.
After suffering from this problem last night, I've decided to look into it.
The root cause seems to be that the postinst of the kernel-image package does a hardcoded
dd
into the kernel partition, which overwrites the kernel of the slot 0 image, not the one of the running image.This can be addressed in the BSP, by using something like findkerneldevice.sh, like other brands do, but that only fixes it for newly build images, not for all those images already out there. You could show a warning in enigma when
kernel-image-*
is amongst the packages being updated, but again that only addresses it for newly build images.Since this is an issue for all image makers, I'm interested in your thoughts on things, so we can come up with a common solution (if any).
The text was updated successfully, but these errors were encountered: