Segmentation fault in otp 26.2 #8051

itssundeep · 2024-01-26T00:00:46Z

Describe the bug
We notice coredump in 0tp 26.2 which was not the case in otp 26.0. Here is the back trace.

(gdb) bt
#0  0x00000000007b2aa6 in erts_proc_sig_fetch__ (proc=0x7f273abf9168, buffers=0x0, need_unget_buffers=0) at beam/erl_proc_sig_queue.c:1229
#1  0x000000000086d7c6 in erts_proc_sig_fetch (proc=0x7f273abf9168) at beam/erl_proc_sig_queue.h:1894
#2  erts_garbage_collect_nobump (p=p@entry=0x7f273abf9168, need=need@entry=0, objv=0x7f25c1b0ade8, nobj=2, fcalls=4000) at beam/erl_gc.c:900
#3  0x000000000061c9c7 in erts_execute_dirty_system_task (c_p=c_p@entry=0x7f273abf9168) at beam/erl_process.c:11055
#4  0x00000000006cf339 in erts_dirty_process_main (esdp=esdp@entry=0x7f25b8720880) at beam/beam_common.c:202
#5  0x0000000000606368 in sched_dirty_cpu_thread_func (vesdp=0x7f25b8720880) at beam/erl_process.c:8720
#6  0x0000000000a237ec in thr_wrapper (vtwd=0x7ffc5e92e3f0) at pthread/ethread.c:116
#7  0x00007f301a89abaf in start_thread (arg=<optimized out>) at pthread_create.c:434
#8  0x00007f301a92d17c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

To Reproduce
Hard to reproduce

Expected behavior
No crashes.

Affected versions
I think it impacts 26.1

Additional context
Info from crash

(gdb) etp-process-info proc
  Pid: <0.24317.3736>
  State: dirty-running-sys | dirty-active-sys | sig-q | active-sys | nmsig-in-q | active | prq-prio-normal | usr-prio-normal | act-prio-normal

  Flags: dirty-minor-gc force-gc
  Current function: erlang:bif_handle_signals_return/2
  I: #Cp<0x7f2563400260>
  Heap size: 196650
  Old-heap size: 1199557
  Mbuf size: 2312
  Msgq len: 2 (inner=2, outer=0)
  Msgq Flags: handling-sig on-heap
  Parent: <0.22824.81>
  Pointer: (Process*)0x7f273abf9168

(gdb) etp-sigqs proc
Msgq Flags: handling-sig on-heap
--- Inner signal queue (message queue) ---
  [{inet_reply,#Port<0.33357938>,ok,#Ref<0.482064594.4069785604.108568>} @token= undefined @from= #Port<0.33357938> % <== SAVE]

  Message signals: 1
  Non-message signals: 0

--- Middle signal queue ---
  [{tcp_closed,#Port<0.33357938>} @token= undefined @from= #Port<0.33357938>]

  Message signals: 1
  Non-message signals: 0

--- Outer queue ---
  [!MONITOR-DOWN[1]]

  Message signals: 0
  Non-message signals: 1


(gdb) p proc->sig_qs.nmsigs.last
$5 = (ErtsMessage **) 0x7fcfb01c3560
(gdb) p *proc->sig_qs.nmsigs.last
$6 = (ErtsMessage *) 0x0

In another thread the same proc receives a signal

(gdb) bt
#0  0x00007f301a9100c7 in sched_yield () at ../sysdeps/unix/syscall-template.S:120
#1  0x00000000007bfd15 in erts_thr_yield () at beam/erl_threads.h:2548
#2  erts_proc_lock_failed (p=p@entry=0x7f273abf9168, pixlck=pixlck@entry=0x0, locks=locks@entry=1, old_lflgs=<optimized out>) at beam/erl_process_lock.c:509
#3  0x000000000078af59 in erts_proc_lock__ (pix_lck=<optimized out>, locks=<optimized out>, p=<optimized out>) at beam/erl_process_lock.h:692
#4  erts_proc_lock (locks=<optimized out>, p=<optimized out>) at beam/erl_process_lock.h:954
#5  erts_schedule_proc2port_signal (prt=<optimized out>, caller=<optimized out>, refp=<optimized out>, sigdp=<optimized out>, task_flags=<optimized out>, pthp=<optimized out>, callback=<optimized out>, c_p=<optimized out>) at beam/io.c:1208
#6  erts_schedule_proc2port_signal (c_p=0x7f273abf9168, prt=0x7f25efbc63c0, caller=6254096339, refp=<optimized out>, sigdp=0x7f25b88d8540, task_flags=0, pthp=0x0, callback=0x78b130 <port_sig_unlink_ack>) at beam/io.c:1153
#7  0x0000000000795fb1 in erts_port_unlink_ack (c_p=c_p@entry=0x7f273abf9168, prt=0x7f25efbc63c0, sulnk=sulnk@entry=0x10e41c0) at beam/io.c:2630
#8  0x00000000007b655f in erts_proc_sig_handle_incoming (c_p=c_p@entry=0x7f273abf9168, statep=statep@entry=0x7f256139ac08, redsp=redsp@entry=0x7f256139ac0c, max_reds=3993, local_only=local_only@entry=0) at beam/erl_proc_sig_queue.c:6169
#9  0x00000000007b9aec in erts_internal_dirty_process_handle_signals_1 (A__p=0x192cb20, BIF__ARGS=<optimized out>, A__I=<optimized out>) at beam/erl_proc_sig_queue.c:8307
#10 0x00007f2563400649 in ?? ()
#11 0x0000000000000000 in ?? ()

I wonder if fix for #7595 is somehow causing these crashes.

cc: @rickard-green

The text was updated successfully, but these errors were encountered:

due to segfault on 26 erlang/otp#8051

rickard-green · 2024-01-31T22:16:00Z

Is it possible to get access to the core file? If so, also the beam.smp file used is needed.

rickard-green · 2024-01-31T22:26:34Z

Looking closer at the excellent information you gave when creating the issue, there is no need for the core file. I see what the issue is. I don't have fix for it yet, though. Will hopefully have a PR with a fix for this this week.

itssundeep · 2024-02-01T17:55:17Z

we also notice sched_util spikes and delay in processing is_process_alive, which results in msgq build up for the process.

Thanks for the response and will wait for the fix.

due to segfault on 26 erlang/otp#8051

rickard-green · 2024-02-05T18:51:35Z

#8088 should fix this crash

we also notice sched_util spikes and delay in processing is_process_alive, which results in msgq build up for the process.

The reason is_process_alive() takes longer time is most likely due to a bug being fixed in OTP 26.1. It previously did not detect all outstanding signals and potentially violated the signal order guarantee of the language. It cannot be made to return as fast as it used to in certain situations without reintroducing the bug.

rickard-green · 2024-02-05T18:53:25Z

Note that #8088 has not been thoroughly tested yet. I've only done some basic testing locally on my machine.

itssundeep · 2024-02-05T23:22:09Z

Thanks for the fix, can we land the fix in maint-26 branch and once it is landed we can try it out.

rickard-green · 2024-02-05T23:31:21Z

The branch in #8088 is based on the top of maint-26 (OTP 26.2.1). When we've tested it enough, the fix will be released as patches on both OTP 25 and OTP 26.

rickard-green · 2024-02-08T14:41:40Z

Patches with fixes for this bug have now been released in OTP 26.2.2 and OTP 25.3.2.9

itssundeep added the bug Issue is reported as a bug label Jan 26, 2024

IngelaAndin added the team:VM Assigned to OTP team VM label Jan 26, 2024

jhogberg assigned rickard-green Jan 29, 2024

id mentioned this issue Jan 31, 2024

ci: use OTP 25.3.2-2 for building docker emqx/emqx#12438

Merged

id added a commit to emqx/emqx that referenced this issue Jan 31, 2024

ci: use OTP 25.3.2-2 for building docker

04cdbe0

due to segfault on 26 erlang/otp#8051

id added a commit to id/emqx that referenced this issue Feb 2, 2024

ci: use OTP 25.3.2-2 for building docker

ee305f2

due to segfault on 26 erlang/otp#8051

rickard-green mentioned this issue Feb 5, 2024

Dirty signal handling fix #8088

Merged

rickard-green linked a pull request Feb 5, 2024 that will close this issue

Dirty signal handling fix #8088

Merged

rickard-green closed this as completed Feb 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmentation fault in otp 26.2 #8051

Segmentation fault in otp 26.2 #8051

itssundeep commented Jan 26, 2024 •

edited

Loading

rickard-green commented Jan 31, 2024

rickard-green commented Jan 31, 2024

itssundeep commented Feb 1, 2024

rickard-green commented Feb 5, 2024

rickard-green commented Feb 5, 2024 •

edited

Loading

itssundeep commented Feb 5, 2024

rickard-green commented Feb 5, 2024

rickard-green commented Feb 8, 2024

Segmentation fault in otp 26.2 #8051

Segmentation fault in otp 26.2 #8051

Comments

itssundeep commented Jan 26, 2024 • edited Loading

rickard-green commented Jan 31, 2024

rickard-green commented Jan 31, 2024

itssundeep commented Feb 1, 2024

rickard-green commented Feb 5, 2024

rickard-green commented Feb 5, 2024 • edited Loading

itssundeep commented Feb 5, 2024

rickard-green commented Feb 5, 2024

rickard-green commented Feb 8, 2024

itssundeep commented Jan 26, 2024 •

edited

Loading

rickard-green commented Feb 5, 2024 •

edited

Loading