You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm having issues getting the GPU acceleration to work. Using an RX 7900 XTX on Fedora Workstation 40, Kernel 6.9.9-200.fc40.x86_64.
I built the library from both master-rocm and v2.0.3-rocm using the instructions in README-ROCm.md (so mainly running cmake -DUSE_HIP=1 ../ and make) with the same result.
Using demo/CLI/binary_classification as a reproduction case, it runs normally in its default configuration, but adding device = "gpu" to mushroom.conf leads to the following issue, other reproduction cases show similar behavior, but with different symbols depending on the usage:
:0:/builddir/build/BUILD/clr-rocm-6.0.2/hipamd/src/hip_global.cpp:114 : 6326756172 us: [pid:35900 tid:0x7fc3aa1142c0] Cannot find Symbol with name: _ZN7xgboost4tree20EvaluateSplitsKernelILi64EEEvjNS_6common4SpanIKNS0_19EvaluateSplitInputsELm18446744073709551615EEENS0_25EvaluateSplitSharedInputsENS3_IjLm18446744073709551615EEENS0_13TreeEvaluator14SplitEvaluatorINS0_16GPUTrainingParamEEENS3_INS0_20DeviceSplitCandidateELm18446744073709551615EEE
./runexp.sh: line 10: 35900 Aborted (core dumped) $XGBOOST mushroom.conf
Unfortunately I'm unfamiliar with the inner workings of GPU acceleration, but what I've verified is that the .so file contains the symbol.
$ nm -gD ../../../lib/libxgboost.so | grep _ZN7xgboost4tree20EvaluateSplitsKernelILi64EEEvjNS_6common4SpanIKNS0_19EvaluateSplitInputsELm18446744073709551615EEENS0_25EvaluateSplitSharedInputsENS3_IjLm18446744073709551615EEENS0_13TreeEvaluator14SplitEvaluatorINS0_16GPUTrainingParamEEENS3_INS0_20DeviceSplitCandidateELm18446744073709551615EEE
0000000002186038 V _ZN7xgboost4tree20EvaluateSplitsKernelILi64EEEvjNS_6common4SpanIKNS0_19EvaluateSplitInputsELm18446744073709551615EEENS0_25EvaluateSplitSharedInputsENS3_IjLm18446744073709551615EEENS0_13TreeEvaluator14SplitEvaluatorINS0_16GPUTrainingParamEEENS3_INS0_20DeviceSplitCandidateELm18446744073709551615EEE
In general, calls to the library seem to work (see attached logs for ltrace -e "hip*" and a run with AMD_LOG_LEVEL=3). Additionally I attached the output of rocminfo and a gdb backtrace of the crash.
Sorry @jakobwinkler, I just noticed the issue. I don't have access to Radeon GPUs, which have different architecture, like wave32, while data center GPUs have wave64.
Problem Description
Hi,
I'm having issues getting the GPU acceleration to work. Using an RX 7900 XTX on Fedora Workstation 40, Kernel 6.9.9-200.fc40.x86_64.
I built the library from both
master-rocm
andv2.0.3-rocm
using the instructions inREADME-ROCm.md
(so mainly runningcmake -DUSE_HIP=1 ../
andmake
) with the same result.Using
demo/CLI/binary_classification
as a reproduction case, it runs normally in its default configuration, but addingdevice = "gpu"
tomushroom.conf
leads to the following issue, other reproduction cases show similar behavior, but with different symbols depending on the usage:Unfortunately I'm unfamiliar with the inner workings of GPU acceleration, but what I've verified is that the
.so
file contains the symbol.In general, calls to the library seem to work (see attached logs for
ltrace -e "hip*"
and a run withAMD_LOG_LEVEL=3
). Additionally I attached the output of rocminfo and a gdb backtrace of the crash.Any pointers would be very welcome.
gdb.log
loglevel3.log
ltrace.log
rocminfo.log
Operating System
Fedora Linux 40 (Workstation Edition) x86_64
CPU
13th Gen Intel(R) Core(TM) i5-13600KF
GPU
AMD Radeon RX 7900 XTX
ROCm Version
ROCm 6.0.0
ROCm Component
No response
Steps to Reproduce
No response
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
Additional Information
No response
The text was updated successfully, but these errors were encountered: