-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PBS_MOM killed on job exit (xpmem_close_handler) #28
Comments
Ok, will take a look. |
tzafrir-mellanox
pushed a commit
to tzafrir-mellanox/xpmem
that referenced
this issue
Sep 11, 2024
Compilation fix for ppc on Ubuntu18.04
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
xpmem_close_handler is forcing a sigkill of the current thread group. In certain cases that means it is killing off the PBS_MOM. Initial guess is that we have some kind of race to free up memory when user job processes are exiting and the appropriate xpmem_detach isn't winning the race.
Original Stack trace recovered via systemtap:
0xffffffffa1170e57 [stap_7ea83ef4a45a84471fe8e1b5f7f3ed52_33201+0x8e57/0x0]
0xffffffffa117297e [stap_7ea83ef4a45a84471fe8e1b5f7f3ed52_33201+0xa97e/0x0]
0xffffffffa1172f8e [stap_7ea83ef4a45a84471fe8e1b5f7f3ed52_33201+0xaf8e/0x0]
0xffffffffa1174295 [stap_7ea83ef4a45a84471fe8e1b5f7f3ed52_33201+0xc295/0x0]
0xffffffffa116801d [stap_7ea83ef4a45a84471fe8e1b5f7f3ed52_33201+0x1d/0x0]
0xffffffff810932b5 : __send_signal+0x245/0x450 [kernel]
0xffffffff8101bfe4 : try_stack_unwind+0x194/0x1b0 [kernel]
0xffffffff8101ae04 : dump_trace+0x64/0x3b0 [kernel]
0xffffffffa1172e88 [stap_7ea83ef4a45a84471fe8e1b5f7f3ed52_33201+0xae88/0x0]
0xffffffffa1172f8e [stap_7ea83ef4a45a84471fe8e1b5f7f3ed52_33201+0xaf8e/0x0]
0xffffffffa1174295 [stap_7ea83ef4a45a84471fe8e1b5f7f3ed52_33201+0xc295/0x0]
0xffffffffa116801d [stap_7ea83ef4a45a84471fe8e1b5f7f3ed52_33201+0x1d/0x0]
0xffffffff810932b5 : __send_signal+0x245/0x450 [kernel]
0xffffffff810934fe : send_signal+0x3e/0x80 [kernel]
0xffffffff81093d30 : force_sig_info+0xb0/0xe0 [kernel]
0xffffffff81093d76 : force_sig+0x16/0x20 [kernel]
0xffffffffa0238a01 : xpmem_close_handler+0x151/0x270 [xpmem]
0xffffffff811d774d : remove_vma+0x2d/0x70 [kernel]
0xffffffff811db09a : exit_mmap+0xea/0x150 [kernel]
0xffffffff81082edf : mmput+0x4f/0x110 [kernel]
We enabled xpmem_debug and captured the following trace along with dmesg log.
The job was started at 16:30, so you can extract the log with (grep "Jun 11 16:3" r1i6n18.gbe.ice.issp.u-tokyo.ac.jp
20180611.tar.gz
The text was updated successfully, but these errors were encountered: