-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HS58: Linux occasionally does not boot with ATLD implementation of atomics #167
Comments
I started the OpenOCD session and then restarted the board. It stopped booting at the “Advanced Linux Sound Architecture Driver Initialized." message. I opened OpenOCD console and typed few command:
The last command ( |
I was able to gather the CPU trace (smart trace). Interestingly, reading AUX regs needed to gather the trace didn’t cause the Linux to continue, unlike “reg pc“ command. The last entry in the smart trace is:
and corresponds to the following instructions from the vmlinux listing assembler code (function
FLAGS = 0x80000400 means that the bit An Could it be a prove that this is some kind of a deadlock? |
I'm attaching example Linux defconfig that is used in our tests. |
I attach SmaRT trace for all cores, as read by the Metaware Debugger |
Thanks Jakub, Looks like the other 2 cores stopped execution somewhere in the function rcu_dynticks_inc:
|
I added defconfig options for additional checking of the locking mechanisms:
The boot log: The boot stalls after The tests seem to pass: This part shows some errors:
|
Another symptom of the issue is that with ATLD the rcutorture module shows "process hang" warning. With previous LLSC option the rcutorture was working fine.
|
I tried reproducing on nSim (multicore), but no luck so far. |
I'm not sure how the mdb was able to read the smart trace data from all the cores, because from my observation if I read (using OpenOCD) SMART_CONTROL or SMART_DATA registers from core 1 or 2 the Linux immediately resumes booting. It does not have any effect only with core 0. So using OpenOCD I can grab only smart trace from core 0. Reading status32 register on any core does not resume the Linux boot.
After the last command on core 2 the CPUs were resumed. |
Reading AUX regs 0x4, 0x6 (IDENTITY, PC) even on core 0 resumes the CPUs. My suspicion is that OpenOCD in case of some commands triggers another read or write, which has the effect described above. There is the hypothesis that it could be a lockup in the CPU pipeline and JTAG read/write fills/flushes the pipeline and unlocks the CPUs. This hypothesis is still valid. @abrodkin @xxkent are you able to reproduce it on your side on your FPGA system? |
I ran Linux on HAPS and in nSIM with options you provided: In both cases there is "Stack Trace message" during boot, this means that this is SW issue:
Linux_boot_HAPS.txt It is quite strange that the deadlock test runs on only core_0 while core_1 and core_2 are in the boot loop. Thus, core_1 and core_2 don't participate in this test. |
I will conduct this test with your hs58_defconfig.txt as the next step. |
The issue (Linux occasionally does not boot with ATLD implementation of atomics) is not reproduced on HAPS. I see some other issue with boot, but it is most likely SW because it exists in nSIM too. |
Thanks for checking this. How many resets have you performed? In our case repro rate is not 100%, so to be sure you will need to perform tens of resets. |
I took exactly config you provided (hs58_defconfig.txt) and changed only CONFIG_LINUX_LINK_BASE=0x0, CONFIG_LINUX_RAM_BASE=0x0, CONFIG_LINUX_MAP_SIZE=0x60000000 and CONFIG_ARC_BUILTIN_DTB_NAME=haps_hs5x_idu. I did about 6 attempts on HAPS. Ok, then I'll be running this throughout the day to get more attempts and if anything changes, I'll let you know. |
A suggestion from Alexey was to check if with ATLD selected in menuconfig there are still ‘llock’ and ‘scond’ instructions in the listing. The result is no, there are none - grep'ing for these finds nothing
|
This issue is not reproducible with single-core SMP Linux. |
Could you provide call stacks for each core from MDB? |
Try this patch also please: 28e6344 |
Hi @xxkent, I was testing issue #168 with the 28e6344 patch applied but the sshd was freezing with the following error:
However, this problem can be reproduced on ssh running on localhost. Steps:
You can try reproducing it on your side, maybe it will help in solving the bugs. |
The good news is that with patch 28e6344 I no longer can see boot-time freeze, tried over 2k times. |
Yes, here new capture of SmaRT + call stacks for each core: This time I reproduced this by loading and starting vmlinux from MDB, result was the same. |
Hi @jzbydniewski! Thank you for this output it is very valuable. For clearer picture can you gather some additional information? Please provide a SmaRT trace and registry dump of all cores for a few additional runs to ensure we are in the same state every time we hang. p.s. unfortunately commit 28e6344 can't be used as a final solution of the described issue, additional changes may be required due to #167 (comment) and moreover it may hide a real problem with CPU. |
Logs for next 5 reproductions: |
I did two experiments with enabled hw counters, as below: hwc count cond=iall cond=crun cond=illock cond=iscond cond=imematomic cond=dbgflush cond=bstall cond=always 1st experiment: issue reproduce, I stopped cores in mdb after <1s: 2nd experiment: issue reproduce, I stopped cores in mdb after ~10s: Regards, |
I can see that resuming cores with use of MDB works (and Linux continues executing) only if I do not pass param Looking at data caches and memory (still with -off=flush_dcache option set), I can see that address 0x81d6609c (const across reproductions with the same vmlinux), that seems to be used with atomic exchange "ex" instruction in core 1 and core 2, is 0x0 in memory, but 0x01 in core 2 data cache. Could this be causing this issue ? cache contents: SmaRT: regs: I did a simple experiment in that state, simply changed the value at 0x81d6609c to 1 and resumed cores (so kind of manual flush) - as a result Linux booted to the console. |
Conduct please the following experiment:
To patch automatically you can add theses commands to your commands.txt file: |
As the next step, if it won't help, you can also do the same for multi_cpu_stop() and _raw_spin_unlock_irqrestore(). |
@xxkent I tried this for the rcu_dynticks_inc() at the first place,
As a result I am getting crash
|
Thanks @jzbydniewski. This is quite unexpected for me that access to a global variable with .di gets "BUS FAULT". Let me to discuss with colleagues. |
This behavior is predictable since the BUS don't support atomic operation, means that we can't bypass this issue such way using .di. |
You swapped 0,1 bytes with 2,3. Disassembly shows bytes 0 and 1 in memory as high bytes and 2 and 3 as low bytes. |
It looks that with CONFIG_ARC_HAS_LL64=n this issue is not showing up, but that might be just due to different timing I guess. |
After switching to ATLD atomics we see occasional hangs while the Linux boots. The last message that is seen on the console is:
Advanced Linux Sound Architecture Driver Initialized.
The behavior was first described here: #162
If I attach OpenOCD JTAG to the LPU the Linux continues booting.
For 2177 resets the issue happened 626 times, so repro rate is quite high 28%.
This seems to be buildroot (Linux) build specific, as this is quite easily reproducible with one buildroot build, but didn’t happen with another one (with changes not related to this issue).
The text was updated successfully, but these errors were encountered: