Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault in otp 26.2 #8051

Closed
itssundeep opened this issue Jan 26, 2024 · 8 comments · Fixed by #8088
Closed

Segmentation fault in otp 26.2 #8051

itssundeep opened this issue Jan 26, 2024 · 8 comments · Fixed by #8088
Assignees
Labels
bug Issue is reported as a bug team:VM Assigned to OTP team VM

Comments

@itssundeep
Copy link
Contributor

itssundeep commented Jan 26, 2024

Describe the bug
We notice coredump in 0tp 26.2 which was not the case in otp 26.0. Here is the back trace.

(gdb) bt
#0  0x00000000007b2aa6 in erts_proc_sig_fetch__ (proc=0x7f273abf9168, buffers=0x0, need_unget_buffers=0) at beam/erl_proc_sig_queue.c:1229
#1  0x000000000086d7c6 in erts_proc_sig_fetch (proc=0x7f273abf9168) at beam/erl_proc_sig_queue.h:1894
#2  erts_garbage_collect_nobump (p=p@entry=0x7f273abf9168, need=need@entry=0, objv=0x7f25c1b0ade8, nobj=2, fcalls=4000) at beam/erl_gc.c:900
#3  0x000000000061c9c7 in erts_execute_dirty_system_task (c_p=c_p@entry=0x7f273abf9168) at beam/erl_process.c:11055
#4  0x00000000006cf339 in erts_dirty_process_main (esdp=esdp@entry=0x7f25b8720880) at beam/beam_common.c:202
#5  0x0000000000606368 in sched_dirty_cpu_thread_func (vesdp=0x7f25b8720880) at beam/erl_process.c:8720
#6  0x0000000000a237ec in thr_wrapper (vtwd=0x7ffc5e92e3f0) at pthread/ethread.c:116
#7  0x00007f301a89abaf in start_thread (arg=<optimized out>) at pthread_create.c:434
#8  0x00007f301a92d17c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

To Reproduce
Hard to reproduce

Expected behavior
No crashes.

Affected versions
I think it impacts 26.1

Additional context
Info from crash

(gdb) etp-process-info proc
  Pid: <0.24317.3736>
  State: dirty-running-sys | dirty-active-sys | sig-q | active-sys | nmsig-in-q | active | prq-prio-normal | usr-prio-normal | act-prio-normal

  Flags: dirty-minor-gc force-gc
  Current function: erlang:bif_handle_signals_return/2
  I: #Cp<0x7f2563400260>
  Heap size: 196650
  Old-heap size: 1199557
  Mbuf size: 2312
  Msgq len: 2 (inner=2, outer=0)
  Msgq Flags: handling-sig on-heap
  Parent: <0.22824.81>
  Pointer: (Process*)0x7f273abf9168
(gdb) etp-sigqs proc
Msgq Flags: handling-sig on-heap
--- Inner signal queue (message queue) ---
  [{inet_reply,#Port<0.33357938>,ok,#Ref<0.482064594.4069785604.108568>} @token= undefined @from= #Port<0.33357938> % <== SAVE]

  Message signals: 1
  Non-message signals: 0

--- Middle signal queue ---
  [{tcp_closed,#Port<0.33357938>} @token= undefined @from= #Port<0.33357938>]

  Message signals: 1
  Non-message signals: 0

--- Outer queue ---
  [!MONITOR-DOWN[1]]

  Message signals: 0
  Non-message signals: 1


(gdb) p proc->sig_qs.nmsigs.last
$5 = (ErtsMessage **) 0x7fcfb01c3560
(gdb) p *proc->sig_qs.nmsigs.last
$6 = (ErtsMessage *) 0x0

In another thread the same proc receives a signal

(gdb) bt
#0  0x00007f301a9100c7 in sched_yield () at ../sysdeps/unix/syscall-template.S:120
#1  0x00000000007bfd15 in erts_thr_yield () at beam/erl_threads.h:2548
#2  erts_proc_lock_failed (p=p@entry=0x7f273abf9168, pixlck=pixlck@entry=0x0, locks=locks@entry=1, old_lflgs=<optimized out>) at beam/erl_process_lock.c:509
#3  0x000000000078af59 in erts_proc_lock__ (pix_lck=<optimized out>, locks=<optimized out>, p=<optimized out>) at beam/erl_process_lock.h:692
#4  erts_proc_lock (locks=<optimized out>, p=<optimized out>) at beam/erl_process_lock.h:954
#5  erts_schedule_proc2port_signal (prt=<optimized out>, caller=<optimized out>, refp=<optimized out>, sigdp=<optimized out>, task_flags=<optimized out>, pthp=<optimized out>, callback=<optimized out>, c_p=<optimized out>) at beam/io.c:1208
#6  erts_schedule_proc2port_signal (c_p=0x7f273abf9168, prt=0x7f25efbc63c0, caller=6254096339, refp=<optimized out>, sigdp=0x7f25b88d8540, task_flags=0, pthp=0x0, callback=0x78b130 <port_sig_unlink_ack>) at beam/io.c:1153
#7  0x0000000000795fb1 in erts_port_unlink_ack (c_p=c_p@entry=0x7f273abf9168, prt=0x7f25efbc63c0, sulnk=sulnk@entry=0x10e41c0) at beam/io.c:2630
#8  0x00000000007b655f in erts_proc_sig_handle_incoming (c_p=c_p@entry=0x7f273abf9168, statep=statep@entry=0x7f256139ac08, redsp=redsp@entry=0x7f256139ac0c, max_reds=3993, local_only=local_only@entry=0) at beam/erl_proc_sig_queue.c:6169
#9  0x00000000007b9aec in erts_internal_dirty_process_handle_signals_1 (A__p=0x192cb20, BIF__ARGS=<optimized out>, A__I=<optimized out>) at beam/erl_proc_sig_queue.c:8307
#10 0x00007f2563400649 in ?? ()
#11 0x0000000000000000 in ?? ()

I wonder if fix for #7595 is somehow causing these crashes.

cc: @rickard-green

@itssundeep itssundeep added the bug Issue is reported as a bug label Jan 26, 2024
@IngelaAndin IngelaAndin added the team:VM Assigned to OTP team VM label Jan 26, 2024
id added a commit to emqx/emqx that referenced this issue Jan 31, 2024
@rickard-green
Copy link
Contributor

Is it possible to get access to the core file? If so, also the beam.smp file used is needed.

@rickard-green
Copy link
Contributor

Looking closer at the excellent information you gave when creating the issue, there is no need for the core file. I see what the issue is. I don't have fix for it yet, though. Will hopefully have a PR with a fix for this this week.

@itssundeep
Copy link
Contributor Author

we also notice sched_util spikes and delay in processing is_process_alive, which results in msgq build up for the process.

Thanks for the response and will wait for the fix.

id added a commit to id/emqx that referenced this issue Feb 2, 2024
@rickard-green rickard-green linked a pull request Feb 5, 2024 that will close this issue
@rickard-green
Copy link
Contributor

#8088 should fix this crash

we also notice sched_util spikes and delay in processing is_process_alive, which results in msgq build up for the process.

The reason is_process_alive() takes longer time is most likely due to a bug being fixed in OTP 26.1. It previously did not detect all outstanding signals and potentially violated the signal order guarantee of the language. It cannot be made to return as fast as it used to in certain situations without reintroducing the bug.

@rickard-green
Copy link
Contributor

rickard-green commented Feb 5, 2024

Note that #8088 has not been thoroughly tested yet. I've only done some basic testing locally on my machine.

@itssundeep
Copy link
Contributor Author

Thanks for the fix, can we land the fix in maint-26 branch and once it is landed we can try it out.

@rickard-green
Copy link
Contributor

The branch in #8088 is based on the top of maint-26 (OTP 26.2.1). When we've tested it enough, the fix will be released as patches on both OTP 25 and OTP 26.

@rickard-green
Copy link
Contributor

Patches with fixes for this bug have now been released in OTP 26.2.2 and OTP 25.3.2.9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue is reported as a bug team:VM Assigned to OTP team VM
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants