Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PBS_MOM killed on job exit (xpmem_close_handler) #28

Open
deffjammer opened this issue Jun 12, 2018 · 1 comment
Open

PBS_MOM killed on job exit (xpmem_close_handler) #28

deffjammer opened this issue Jun 12, 2018 · 1 comment
Assignees

Comments

@deffjammer
Copy link

deffjammer commented Jun 12, 2018

xpmem_close_handler is forcing a sigkill of the current thread group. In certain cases that means it is killing off the PBS_MOM. Initial guess is that we have some kind of race to free up memory when user job processes are exiting and the appropriate xpmem_detach isn't winning the race.

Original Stack trace recovered via systemtap:

0xffffffffa1170e57 [stap_7ea83ef4a45a84471fe8e1b5f7f3ed52_33201+0x8e57/0x0]
0xffffffffa117297e [stap_7ea83ef4a45a84471fe8e1b5f7f3ed52_33201+0xa97e/0x0]
0xffffffffa1172f8e [stap_7ea83ef4a45a84471fe8e1b5f7f3ed52_33201+0xaf8e/0x0]
0xffffffffa1174295 [stap_7ea83ef4a45a84471fe8e1b5f7f3ed52_33201+0xc295/0x0]
0xffffffffa116801d [stap_7ea83ef4a45a84471fe8e1b5f7f3ed52_33201+0x1d/0x0]
0xffffffff810932b5 : __send_signal+0x245/0x450 [kernel]
0xffffffff8101bfe4 : try_stack_unwind+0x194/0x1b0 [kernel]
0xffffffff8101ae04 : dump_trace+0x64/0x3b0 [kernel]
0xffffffffa1172e88 [stap_7ea83ef4a45a84471fe8e1b5f7f3ed52_33201+0xae88/0x0]
0xffffffffa1172f8e [stap_7ea83ef4a45a84471fe8e1b5f7f3ed52_33201+0xaf8e/0x0]
0xffffffffa1174295 [stap_7ea83ef4a45a84471fe8e1b5f7f3ed52_33201+0xc295/0x0]
0xffffffffa116801d [stap_7ea83ef4a45a84471fe8e1b5f7f3ed52_33201+0x1d/0x0]
0xffffffff810932b5 : __send_signal+0x245/0x450 [kernel]
0xffffffff810934fe : send_signal+0x3e/0x80 [kernel]
0xffffffff81093d30 : force_sig_info+0xb0/0xe0 [kernel]
0xffffffff81093d76 : force_sig+0x16/0x20 [kernel]
0xffffffffa0238a01 : xpmem_close_handler+0x151/0x270 [xpmem]
0xffffffff811d774d : remove_vma+0x2d/0x70 [kernel]
0xffffffff811db09a : exit_mmap+0xea/0x150 [kernel]
0xffffffff81082edf : mmput+0x4f/0x110 [kernel]

We enabled xpmem_debug and captured the following trace along with dmesg log.

The job was started at 16:30, so you can extract the log with (grep "Jun 11 16:3" r1i6n18.gbe.ice.issp.u-tokyo.ac.jp

20180611.tar.gz

@hjelmn
Copy link
Collaborator

hjelmn commented Sep 26, 2019

Ok, will take a look.

@hjelmn hjelmn self-assigned this Sep 26, 2019
tzafrir-mellanox pushed a commit to tzafrir-mellanox/xpmem that referenced this issue Sep 11, 2024
Compilation fix for ppc on Ubuntu18.04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants