Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zombie/defunct processes caused by xpmem? #45

Open
angainor opened this issue Mar 22, 2021 · 3 comments
Open

Zombie/defunct processes caused by xpmem? #45

angainor opened this issue Mar 22, 2021 · 3 comments

Comments

@angainor
Copy link

I'm using xpmem in our home-brew application (OpenMPI + our own xpmem for in-node comm), on an AMD EPYC cluster, 7.7 (Maipo), kernel 3.10.0-1062.9.1.el7.x86_64. Sometimes after the applications finishes, multiple compute nodes have plenty of zombie/defunct processes that never die. Looking at the stack of some of those processes, I see this:

[<ffffffffb6acbf5e>] __synchronize_srcu+0xfe/0x150
[<ffffffffb6acbfcd>] synchronize_srcu+0x1d/0x20
[<ffffffffb6c1c10d>] mmu_notifier_unregister+0xad/0xe0
[<ffffffffc0b5e614>] xpmem_mmu_notifier_unlink+0x54/0x97 [xpmem]
[<ffffffffc0b5a13d>] xpmem_flush+0x13d/0x1c0 [xpmem]
[<ffffffffb6c47ce7>] filp_close+0x37/0x90
[<ffffffffb6c6b0b8>] put_files_struct+0x88/0xe0
[<ffffffffb6c6b1b9>] exit_files+0x49/0x50
[<ffffffffb6aa2022>] do_exit+0x2b2/0xa50
[<ffffffffb6aa283f>] do_group_exit+0x3f/0xa0
[<ffffffffb6aa28b4>] SyS_exit_group+0x14/0x20
[<ffffffffb718dede>] system_call_fastpath+0x25/0x2a
[<ffffffffffffffff>] 0xffffffffffffffff

So they seem to be hanging on some XPMEM-related process cleanup. This is strange for a few reasons: I checked, and in the code I match each xpmem_attach with an xpmem_detatch. Also, it seems strange that the kernel would be unable to end a process, because xpmem is unable to perform cleanup.

Does anyone have any ideas as to what might be the problem here?

Thanks a lot!

@angainor
Copy link
Author

@hjelmn not sure if this is important, but I've noticed that I get the deadlock less often when I don't call xpmem_remove explicitly in my code. This makes me wonder: is there a possible cleanup problem when the publisher calls xpmem_remove on a region, which is still attached to by the peers? In other words, publisher calls xpmem_remove and only then the peer calls xpmem_detach and xpmem_release.

@cvmeq
Copy link

cvmeq commented Jun 14, 2023

@angainor have you found a solution or root cause for this? We are experiencing very similar crashes resulting in zombie/defunct processes in our AMD cluster running RHEL 8.6 and MLNX_OFED_LINUX-5.8-1.0.1.1:

[Wed Jun 14 17:28:43 2023] Call Trace:
[Wed Jun 14 17:28:43 2023]  __schedule+0x2d1/0x840
[Wed Jun 14 17:28:43 2023]  schedule+0x35/0xa0
[Wed Jun 14 17:28:43 2023]  schedule_timeout+0x278/0x300
[Wed Jun 14 17:28:43 2023]  ? number+0x324/0x360
[Wed Jun 14 17:28:43 2023]  ? get_futex_key+0x98/0x3e0
[Wed Jun 14 17:28:43 2023]  wait_for_completion+0x96/0x100
[Wed Jun 14 17:28:43 2023]  __synchronize_srcu.part.17+0x83/0xb0
[Wed Jun 14 17:28:43 2023]  ? __bpf_trace_rcu_utilization+0x10/0x10
[Wed Jun 14 17:28:43 2023]  ? synchronize_srcu+0xad/0xf0
[Wed Jun 14 17:28:43 2023]  mmu_notifier_unregister+0xa6/0xe0
[Wed Jun 14 17:28:43 2023]  xpmem_flush+0x14a/0x170 [xpmem]
[Wed Jun 14 17:28:43 2023]  filp_close+0x31/0x70
[Wed Jun 14 17:28:43 2023]  put_files_struct+0x70/0xc0
[Wed Jun 14 17:28:43 2023]  do_exit+0x32f/0xb10
[Wed Jun 14 17:28:43 2023]  do_group_exit+0x3a/0xa0
[Wed Jun 14 17:28:43 2023]  get_signal+0x158/0x870
[Wed Jun 14 17:28:43 2023]  do_signal+0x36/0x690
[Wed Jun 14 17:28:43 2023]  ? do_send_sig_info+0x63/0x90
[Wed Jun 14 17:28:43 2023]  ? recalc_sigpending+0x17/0x60
[Wed Jun 14 17:28:43 2023]  exit_to_usermode_loop+0x89/0x100
[Wed Jun 14 17:28:43 2023]  do_syscall_64+0x19c/0x1b0
[Wed Jun 14 17:28:43 2023]  entry_SYSCALL_64_after_hwframe+0x61/0xc6

@angainor
Copy link
Author

angainor commented Jun 27, 2023

@cvmeq Unfortunately no, I still see those issues sometimes, mostly when you kill / interrupt a large job, or at job cleanup. Then only solution for me was not to use xpmem transport in OpenMPI / UCX

tzafrir-mellanox pushed a commit to tzafrir-mellanox/xpmem that referenced this issue Sep 11, 2024
KERNEL: Also support kernel 6.5+
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants