Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault when using XGBClassifier with n_jobs >= 74780 #10869

Closed
dakaidan opened this issue Oct 1, 2024 · 2 comments · Fixed by #10872
Closed

Segfault when using XGBClassifier with n_jobs >= 74780 #10869

dakaidan opened this issue Oct 1, 2024 · 2 comments · Fixed by #10872

Comments

@dakaidan
Copy link

dakaidan commented Oct 1, 2024

What is the issue

When running the fit method of the XGBClassifier with any data, and n_jobs set to any value greater than 74780, a segfault occurs.
This is likely a platform specific value, 2^17 seems to be a consistently crashing value across machines, and so is used in the example

Steps to reproduce

This error occured using the python module, version 2.1.1, here is the full set of python modules:

joblib==1.4.2
numpy==2.1.1
nvidia-nccl-cu12==2.23.4
scikit-learn==1.5.2
scipy==1.14.1
threadpoolctl==3.5.0
xgboost==2.1.1

This persists across python 3.8 to 3.13

This was run on Ubuntu 22.04.5 LTS, on AMD EPYC-Milan with 32GB Memory.

The code is:

import random
from xgboost import XGBClassifier

random.seed(0)

model7_Variable = XGBClassifier(n_jobs=2**17)
model7_Variable.fit([[1, 2], [3, 4]], [1, 0])

running this:

> python3.13 segfault_xgboost.py 
Segmentation fault (core dumped)

running this with gdb to obtain the backtrace at the segfault gives:

(gdb) backtrace
#0  0x00007fff9181b999 in gomp_team_start ()
   from /home/ubuntu/.local/lib/python3.13/site-packages/xgboost/lib/../../xgboost.libs/libgomp-24e2ab19.so.1.0.0
#1  0x00007fff91811721 in GOMP_parallel ()
   from /home/ubuntu/.local/lib/python3.13/site-packages/xgboost/lib/../../xgboost.libs/libgomp-24e2ab19.so.1.0.0
#2  0x00007fff921175b6 in decltype(auto) xgboost::data::HostAdapterDispatch<true, xgboost::data::IterativeDMatrix::InitFromCPU(xgboost::Context const*, xgboost::BatchParam const&, void*, float, std::shared_ptr<xgboost::DMatrix>)::{lambda()#3}::operator()() const::{lambda(auto:1 const&)#1}>(xgboost::data::DMatrixProxy const*, xgboost::data::IterativeDMatrix::InitFromCPU(xgboost::Context const*, xgboost::BatchParam const&, void*, float, std::shared_ptr<xgboost::DMatrix>)::{lambda()#3}::operator()() const::{lambda(auto:1 const&)#1}, bool*) [clone .constprop.0] ()
   from /home/ubuntu/.local/lib/python3.13/site-packages/xgboost/lib/libxgboost.so
#3  0x00007fff9211a396 in xgboost::data::IterativeDMatrix::InitFromCPU(xgboost::Context const*, xgboost::BatchParam const&, void*, float, std::shared_ptr<xgboost::DMatrix>) ()
   from /home/ubuntu/.local/lib/python3.13/site-packages/xgboost/lib/libxgboost.so
#4  0x00007fff9211d82c in xgboost::data::IterativeDMatrix::IterativeDMatrix(void*, void*, std::shared_ptr<xgboost::DMatrix>, void (*)(void*), int (*)(void*), float, int, int) ()
   from /home/ubuntu/.local/lib/python3.13/site-packages/xgboost/lib/libxgboost.so
#5  0x00007fff920cce3a in xgboost::DMatrix* xgboost::DMatrix::Create<void*, void*, void (void*), int (void*)>(void*, void*, std::shared_ptr<xgboost::DMatrix>, void (*)(void*), int (*)(void*), float, int, int)
    () from /home/ubuntu/.local/lib/python3.13/site-packages/xgboost/lib/libxgboost.so
#6  0x00007fff91d4546c in XGQuantileDMatrixCreateFromCallback ()
   from /home/ubuntu/.local/lib/python3.13/site-packages/xgboost/lib/libxgboost.so
#7  0x00007ffff7696e2e in ?? () from /lib/x86_64-linux-gnu/libffi.so.8
#8  0x00007ffff7693493 in ?? () from /lib/x86_64-linux-gnu/libffi.so.8
#9  0x00007ffff76bbc63 in _call_function_pointer (argtypecount=<optimized out>, argcount=7, 
    resmem=0x7fffffffd770, restype=<optimized out>, atypes=<optimized out>, avalues=<optimized out>, 
    pProc=0x7fff91d452e0 <XGQuantileDMatrixCreateFromCallback>, flags=<optimized out>, 
    st=0x7ffff776af50) at ./Modules/_ctypes/callproc.c:950
#10 _ctypes_callproc (st=st@entry=0x7ffff776af50, pProc=<optimized out>, 
    argtuple=argtuple@entry=0x7ffff7761fc0, flags=<optimized out>, argtypes=<optimized out>, 
    restype=<optimized out>, checker=<optimized out>) at ./Modules/_ctypes/callproc.c:1300
#11 0x00007ffff76b4fba in PyCFuncPtr_call (self=self@entry=0x7fff9fa16390, 
    inargs=inargs@entry=0x7ffff7761fc0, kwds=kwds@entry=0x0) at ./Modules/_ctypes/_ctypes.c:4348
#12 0x00005555556fb77e in _PyObject_MakeTpCall (tstate=0x555555a92c60 <_PyRuntime+282976>, 
    callable=0x7fff9fa16390, args=0x7ffff7fb0820, nargs=7, keywords=0x0) at Objects/call.c:242
#13 0x000055555576415f in _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=<optimized out>, 
    throwflag=<optimized out>) at Python/generated_cases.c.h:813
#14 0x00005555556fc767 in _PyFunction_Vectorcall (kwnames=<optimized out>, nargsf=<optimized out>, 
    stack=0x7fff9f8bad38, func=0x7fff9fa88860) at Objects/call.c:413
#15 _PyObject_VectorcallDictTstate (kwargs=<optimized out>, nargsf=<optimized out>, 
    args=0x7fffffffdc80, callable=0x7fff9fa88860, tstate=0x555555a92c60 <_PyRuntime+282976>)
    at Objects/call.c:146
#16 _PyObject_Call_Prepend (tstate=tstate@entry=0x555555a92c60 <_PyRuntime+282976>, 
    callable=callable@entry=0x7fff9fa88860, obj=obj@entry=0x7fff9fa99550, 
    args=args@entry=0x555555a633b8 <_PyRuntime+88248>, kwargs=<optimized out>) at Objects/call.c:504
#17 0x0000555555739e4c in slot_tp_init (self=0x7fff9fa99550, args=0x555555a633b8 <_PyRuntime+88248>, 
    kwds=<optimized out>) at Objects/typeobject.c:9780
#18 0x0000555555736995 in type_call (self=self@entry=0x555556f6dc50, 
    args=args@entry=0x555555a633b8 <_PyRuntime+88248>, kwds=0x7fffa0a41e40) at Objects/typeobject.c:1990
#19 0x00005555556fce7b in _PyObject_Call (kwargs=<optimized out>, 
    args=0x555555a633b8 <_PyRuntime+88248>, callable=0x555556f6dc50, 
    tstate=0x555555a92c60 <_PyRuntime+282976>) at Objects/call.c:361
#20 PyObject_Call (callable=0x555556f6dc50, args=0x555555a633b8 <_PyRuntime+88248>, 
    kwargs=<optimized out>) at Objects/call.c:373
#21 0x000055555576547a in _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=<optimized out>, 
    throwflag=<optimized out>) at Python/generated_cases.c.h:1355
#22 0x00005555557f9c09 in PyEval_EvalCode (co=co@entry=0x7ffff78928e0, 
    globals=globals@entry=0x7ffff772c0c0, locals=locals@entry=0x7ffff772c0c0) at Python/ceval.c:596
#23 0x0000555555818bf2 in run_eval_code_obj (tstate=0x555555a92c60 <_PyRuntime+282976>, 
--Type <RET> for more, q to quit, c to continue without paging--
    , globals=0x7ffff772c0c0, locals=0x7ffff772c0c0) at Python/pythonrun.c:1323
#24 0x0000555555818a4e in run_mod (mod=<optimized out>, filename=<optimized out>, globals=0x7ffff772c0c0, locals=0x7ffff772c0c0, flags=<optimized out>, arena=<optimized out>, interactive_src=0x0, 
    generate_new_source=0) at Python/pythonrun.c:1408
#25 0x00005555558193b4 in pyrun_file (fp=fp@entry=0x555555b062e0, filename=filename@entry=0x7ffff7761a70, start=start@entry=257, globals=globals@entry=0x7ffff772c0c0, locals=locals@entry=0x7ffff772c0c0, 
    closeit=closeit@entry=1, flags=0x7fffffffe208) at Python/pythonrun.c:1241
#26 0x0000555555818f11 in _PyRun_SimpleFileObject (fp=0x555555b062e0, filename=0x7ffff7761a70, closeit=1, flags=0x7fffffffe208) at Python/pythonrun.c:490
#27 0x0000555555818d37 in _PyRun_AnyFileObject (fp=0x555555b062e0, filename=filename@entry=0x7ffff7761a70, closeit=closeit@entry=1, flags=flags@entry=0x7fffffffe208) at Python/pythonrun.c:77
#28 0x0000555555822577 in pymain_run_file_obj (skip_source_first_line=0, filename=0x7ffff7761a70, program_name=0x7ffff77168d0) at Modules/main.c:409
#29 pymain_run_file (config=0x555555a65358 <_PyRuntime+96344>) at Modules/main.c:428
#30 pymain_run_python (exitcode=0x7fffffffe1fc) at Modules/main.c:696
#31 Py_RunMain () at Modules/main.c:775
#32 0x000055555582207d in Py_BytesMain (argc=<optimized out>, argv=<optimized out>) at Modules/main.c:829
#33 0x00007ffff7cc5d90 in __libc_start_call_main (main=main@entry=0x5555556d8420 <main>, argc=argc@entry=2, argv=argv@entry=0x7fffffffe458) at ../sysdeps/nptl/libc_start_call_main.h:58
#34 0x00007ffff7cc5e40 in __libc_start_main_impl (main=0x5555556d8420 <main>, argc=2, argv=0x7fffffffe458, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffe448)
    at ../csu/libc-start.c:392
#35 0x000055555579a575 in _start ()

Expected behaviour

To receive an error code for resource acquisition as with smaller values:

libgomp: Thread creation failed: Resource temporarily unavailable
@trivialfis
Copy link
Member

This seems like an issue in gomp instead?

@trivialfis
Copy link
Member

Hi, I opened a PR for the fix #10872 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants