-
Notifications
You must be signed in to change notification settings - Fork 17
Users reporting 4GB Pi 4 still reboots #47
Comments
Just an observation connected with simult. task count reliability: for several days my 4GB raspberry had been running stable with 4 cores occupied, but when I started fiddling with overclocking yesterday, it started rebooting by itself. That might just mean for the previous days I got specific tasks less demanding of the CPU, one device is surely too little to judge, but that made me think if fiddling with underclocking, instead of limiting core count, would give any benefit for stability that would make overall task count higher than the current solution. Or, setting a kernel cgroup limit to use certain percentage of the cpu (available as compose config field) |
@ptrm that's definitely something we should investigate. If you think about it from a fleet level we have the ability to deploy to thousands of devices and gather metrics on how they perform in order to figure out what the settings are that result in most work units being completed. |
@ptrm on a fleet level we're still seeing a lot of Pi 4 reboots which does seem to be 4GB boards, even when limited to 3 tasks. |
There are certailny many more indicators than I wrote about below, but here is what I managed to do for my two rpi4s to get stable under current load settings (1 core for 2GB rpi, 3 cores for the 4GB one). One thing that turned out to be reboting my devices was undervoltage and underpowering. It's a common problem, especially for rpi4. Basically raspberrys, and the rpi4 the most, require the voltage to be stable and possibly closest to 5V, and most general use and even high-current chargers provide ~4.9 or less voltage under zero load, and then even less as the current rises (which is ok for charging 3.7V li-ion/poli batteries). I came up with this snippet as a helpful tool to paste into balena os shell (rpi3 balenaos seems to not have vc tools installed): while true; do \
sleep 1; \
clear; \
date --iso-8601=s; \
echo -ne 'vcgencmd get_throttled:\t\t'; \
echo "ibase=16;obase=2;$(vcgencmd get_throttled| sed -E 's/^[^=]+=0x//')"|bc ; \
echo -ne 'vcgencmd measure_clock arm:\t'; vcgencmd measure_clock arm; \
done If something more than zero is output in get_throttled, it means some undervoltage occured, and it was usually corellating with reboots of my device. See the docs under get_throttled. There are separate flags for freq capping, undervoltage, and temperature excess for the past and current moment. Here is my properly powered rpi4 for example (overclocked to 1.7Ghz), and if it would ever have been underpowered since last reboot, the get_throttled value would look something like |
And fleetwise, it might be good to write something on the project's webpage about good (or official) power supply. Plus, now I remembered that after first deployents to balena I got the device-level variable |
How to distinguish reboots from "last online" status btw? Does the http API provide more options? I have a machine that's said to be online for 2 hours, but it's uptime is in balena OS is 23:19, so indicates no reboots at all :o |
@ptrm that's a good point you make and something I hadn't considered. Initially when we were looking at this issue, reboots were definitely occurring and resetting the device uptime as expected. However now I'm looking at a sample of devices from the fleet that have been online for a few minutes, and their uptimes are all measured in days. Perhaps the limitation to 3 tasks had a more substantial effect than I first thought. We did see a marked jump in output after the fleet was updated on Friday morning: https://www.boincstats.com/stats/14/team/detail/18832/charts The balenaCloud dashboard does have a per-device diagnostics facility which checks for undercurrent/underpower events (see here), but there's no way to run this on an entire fleet and correlate results at the moment. |
Added issue regarding missing |
Yeah, I noticed it can be checked here as well: Glad it's opensourced, though, the scripts look very useful. EDIT: would be good to have them run separately, and also, maybe there's a way to tag a machine from the supervisor level to see in the fleet a flat regarding having ever been underpowered? (Seeing tags can have values, I assume even underpower counts might get into play) And yeah, the chart looks impressive |
Users are still reporting that 4GB Pi 4s reset when working on 4 tasks. Reports say 3 tasks are OK. 2GB devices are now running stably with 1 task.
Reduce the number of Pi 4 allocated tasks to 3 by setting CPU usage percent to 75.
The text was updated successfully, but these errors were encountered: